Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gzip CSV for Benchmark Cmd Line Results #177

Closed
stanbrub opened this issue Oct 4, 2023 · 1 comment · Fixed by #218
Closed

Gzip CSV for Benchmark Cmd Line Results #177

stanbrub opened this issue Oct 4, 2023 · 1 comment · Fixed by #218
Assignees
Labels
enhancement New feature or request

Comments

@stanbrub
Copy link
Collaborator

stanbrub commented Oct 4, 2023

The current set of CSV files generated for benchmark runs is meant to be readable rather than small. There is a lot of redundancy, for example, in the metrics CSV the benchmark name is repeated over and over. This is not an issue when writing the tests and doing test runs, but when it's uploaded to GCloud, it can add up to a lot. Furthermore, with the increasing usage of query snippets that download data for queries, transfer size can be an impediment.

Rather than turn the Benchmark results into a database with references to keep data smaller, save the CSV files as csv.gzip when doing command line runs. A new property to turn this on and off may or may not be necessary. When the test is run in an IDE, don't compress. When it's run from the command line, compress.

  • Use Java to compress the files, or just run gzip in the Github workflow?
  • Compress existing files stored in GCloud bucket?
    • Can file dates be preserverd?
  • Update Demo rsync workflow step not to use the -J options for compression
    • An added bonus is eliminating the annoying gsutils message about how it's more efficient to compress the files in GCloud than to do it on-the-fly for every download

NOTES

  • read_csv does not handle downloading http(s) files that are compressed
    • In fact if you specify a local file as "file:///data/file.csv.gz", it won't handle that either
    • If you specify the local file as "/data/file.csv.gz", it does
  • On the demo server, reading csv.gz vs csv does not improve demo performance
@stanbrub stanbrub added the enhancement New feature or request label Oct 4, 2023
@stanbrub stanbrub self-assigned this Nov 1, 2023
@stanbrub
Copy link
Collaborator Author

stanbrub commented Nov 6, 2023

Wrote a script that uses python asyncio to download GCloud files to a cache.
download.py.txt

@stanbrub stanbrub linked a pull request Nov 14, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant