You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using SQL "INSERT INTO EXTERN( s3()) AS CSV" as prefix part of deep-storage MSQ query, currently the query output is indeed written to S3 location, but it comes with a cost of needing assembling into one file because the MSQ output are distributed into multiple partitions/parts in the S3 location.
It would be super-useful if there is an option to have the deep-storage MSQ to output into one final CSV file instead.
Motivation
Please provide the following for the desired feature or change:
A detailed description of the intended use case, if applicable
The CSV one-file-output stored on S3 can be accessed conveniently, without knowing the assembly details of multiple smaller partition sub-files.
Rationale for why the desired feature/change would be beneficial
Deep-storage MSQ query is an important contribution to Druid as an alternative to query cold/backup data in a non-urgent timeline. Currently getting the query result back is challenging for tge user community, because each of the following methods has drawbacks:
Getting the result through the Druid endpoint “druid/v2/sql/statements/{queryId}/results/page=[page$]&&resultFormat=csv”, is sequential and very time-consuming (one test I did is that it takes 30min to get 1G of query result, using a curl HTTPS call running inside the same AWS region (as the S3 bucket))
As explained in the above feature request, using "INSERT INTO EXTERN( s3()) AS CSV". The output is distributed across many smaller files. The deep-query-caller need to know the implementation-details of how these files are named/sequenced to be able to assemble them back into one complete query result. This is neither error-proof nor productive for the Druid user community.
The text was updated successfully, but these errors were encountered:
Getting the result through the Druid endpoint “druid/v2/sql/statements/{queryId}/results/page=[page$]&&resultFormat=csv”, is sequential and very time-consuming (one test I did is that it takes 30min to get 1G of query result, using a curl HTTPS call running inside the same AWS region (as the S3 bucket))
If you could share the flamegraph of the broker while running this, I can debug the cause of slowness.
Regarding 2, the processing engine has various tasks which are distributed. To do what you are saying means effectively having one task in the final stage since we want to preserve ordering and stuff across the tasks which is not a scalable design.
For your use case, if you want to do this, add a limit to the query. It will force the last stage to be a single stage which will result in a single csv file on s3.
Hi @cryptoe, I tested on Druid 29.0.1, with INSERT INTO () AS CSV and SQL statement ends with a very large limit, so far when the MSQ deep-storage query is finished, I only see one small-sized CSV file with partition0.
Not sure since which Druid version, is writing to one single final CSV file in S3 (with LIMIT set in SQL) supported. Thanks for your time.
Description
Using SQL "INSERT INTO EXTERN( s3()) AS CSV" as prefix part of deep-storage MSQ query, currently the query output is indeed written to S3 location, but it comes with a cost of needing assembling into one file because the MSQ output are distributed into multiple partitions/parts in the S3 location.
It would be super-useful if there is an option to have the deep-storage MSQ to output into one final CSV file instead.
Motivation
Please provide the following for the desired feature or change:
The CSV one-file-output stored on S3 can be accessed conveniently, without knowing the assembly details of multiple smaller partition sub-files.
Deep-storage MSQ query is an important contribution to Druid as an alternative to query cold/backup data in a non-urgent timeline. Currently getting the query result back is challenging for tge user community, because each of the following methods has drawbacks:
The text was updated successfully, but these errors were encountered: