deep-storage MSQ query be able to write to one CSV file on S3 #17709

bbull77 · 2025-02-10T07:39:10Z

Description

Using SQL "INSERT INTO EXTERN( s3()) AS CSV" as prefix part of deep-storage MSQ query, currently the query output is indeed written to S3 location, but it comes with a cost of needing assembling into one file because the MSQ output are distributed into multiple partitions/parts in the S3 location.

It would be super-useful if there is an option to have the deep-storage MSQ to output into one final CSV file instead.

Motivation

Please provide the following for the desired feature or change:

A detailed description of the intended use case, if applicable

The CSV one-file-output stored on S3 can be accessed conveniently, without knowing the assembly details of multiple smaller partition sub-files.

Rationale for why the desired feature/change would be beneficial

Deep-storage MSQ query is an important contribution to Druid as an alternative to query cold/backup data in a non-urgent timeline. Currently getting the query result back is challenging for tge user community, because each of the following methods has drawbacks:

Getting the result through the Druid endpoint “druid/v2/sql/statements/{queryId}/results/page=[page$]&&resultFormat=csv”, is sequential and very time-consuming (one test I did is that it takes 30min to get 1G of query result, using a curl HTTPS call running inside the same AWS region (as the S3 bucket))
As explained in the above feature request, using "INSERT INTO EXTERN( s3()) AS CSV". The output is distributed across many smaller files. The deep-query-caller need to know the implementation-details of how these files are named/sequenced to be able to assemble them back into one complete query result. This is neither error-proof nor productive for the Druid user community.

cryptoe · 2025-02-12T16:27:20Z

Getting the result through the Druid endpoint “druid/v2/sql/statements/{queryId}/results/page=[page$]&&resultFormat=csv”, is sequential and very time-consuming (one test I did is that it takes 30min to get 1G of query result, using a curl HTTPS call running inside the same AWS region (as the S3 bucket))

If you could share the flamegraph of the broker while running this, I can debug the cause of slowness.

Regarding 2, the processing engine has various tasks which are distributed. To do what you are saying means effectively having one task in the final stage since we want to preserve ordering and stuff across the tasks which is not a scalable design.
For your use case, if you want to do this, add a limit to the query. It will force the last stage to be a single stage which will result in a single csv file on s3.

bbull77 · 2025-02-12T18:36:40Z

Thank you Karan very much. I will experiment with option #2.

bbull77 · 2025-02-24T01:18:48Z

Hi @cryptoe, I tested on Druid 29.0.1, with INSERT INTO () AS CSV and SQL statement ends with a very large limit, so far when the MSQ deep-storage query is finished, I only see one small-sized CSV file with partition0.

Not sure since which Druid version, is writing to one single final CSV file in S3 (with LIMIT set in SQL) supported. Thanks for your time.

bbull77 added the Feature/Change Description label Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deep-storage MSQ query be able to write to one CSV file on S3 #17709

deep-storage MSQ query be able to write to one CSV file on S3 #17709

bbull77 commented Feb 10, 2025 •

edited

Loading

cryptoe commented Feb 12, 2025

bbull77 commented Feb 12, 2025

bbull77 commented Feb 24, 2025 •

edited

Loading

deep-storage MSQ query be able to write to one CSV file on S3 #17709

deep-storage MSQ query be able to write to one CSV file on S3 #17709

Comments

bbull77 commented Feb 10, 2025 • edited Loading

Description

Motivation

cryptoe commented Feb 12, 2025

bbull77 commented Feb 12, 2025

bbull77 commented Feb 24, 2025 • edited Loading

bbull77 commented Feb 10, 2025 •

edited

Loading

bbull77 commented Feb 24, 2025 •

edited

Loading