-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GZIP support for JSON in S3 sink connector #1028
Add GZIP support for JSON in S3 sink connector #1028
Conversation
The changes look good to me. Thank you very much for the contribution 👍 Build is not currently passing, I think you just need to do an It would be nice if there were some automated tests covering this code. |
@davidsloan Sounds good, thanks for the quick response. Ran |
The CI failures are due to the missing codec implicit in the tests; fixing them shortly. |
dd593c1
to
b1723e8
Compare
Fixed, verified with local tests. |
ba0b727
to
cb4d268
Compare
@davidsloan Fixed CI & added two tests for writing compressed output streams via Let me know if there's anything else required here, or if you have any remaining questions/concerns. |
Made a small mistake when updating the function signature to include the codec type -- didn't add the import; fixed it in the latest commit & verified the local tests for |
@brandon-powers I couldn't see any tests around file naming with gzip, think that would be useful. Apologies if I've missed them. Otherwise it looks good to me, are you happy for me to merge it into master? |
@davidsloan Good call, I broke out the file extension naming into Going to run a final test locally, and as long as everything looks good + CI is green, 👍 to merge. I'll comment again here when it's all set. |
fc3b9ba
to
5482c40
Compare
Last CI trigger failed due to a formatting issue in the new namer file; fixed it & amended the last commit. |
@davidsloan 👍 to merge, so long as you have no further concerns. For context, I ran the following local verification test:
$ aws s3 ls s3://test-bucket/
PRE .indexes/
PRE compressed/
PRE sink-topic-a/
$ aws s3 ls s3://test-bucket/sink-topic-a/1/
2024-02-20 15:16:07 100538 000000000059.json
2024-02-20 15:16:09 100442 000000000120.json
...
$ aws s3 ls s3://test-bucket/compressed/sink-topic-a/1/
2024-02-20 15:17:20 89902 000000004999.gz
$ aws s3 cp s3://test-bucket/compressed/sink-topic-a/1/000000004999.gz .
$ gzip -d 000000004999.gz
$ cat 000000004999 | wc -l
5000 |
Solves #875 for the AWS S3 sink connector; planning on adding support for the source connector in a follow-up PR.
This PR makes the following changes:
CompressionCodecSettings
fromS3SinkConfigDefBuilder
toCloudSinkConfigDefBuilder
so that it can be used inCloudSinkBucketOptions
to update the file extension in theFileNamer
if the specified format and compression configuration require it (e.g. JSON and GZIP).JsonFormatWriter
toToJsonDataConverter
.CompressionCodec
class withextension
field for format + compression combinations that require a file extension change. Propogated the changes.GzipCompressorOutputStream
class inJsonFormatWriter
; enable configurable compression level & remain consistent with existing connector configuration fields.TextFormatStreamReaderTest
naming.Note: theoretically, this should also add support for GZIP/JSON compression in GCP + Azure, though I've only tested it with S3.
Modules affected:
kafka-connect-cloud-common
kafka-connect-aws-s3
kafka-connect-azure-datalake
kafka-connect-gcp-storage