Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: s5cmd sync logic #110

Conversation

vkt1414
Copy link
Collaborator

@vkt1414 vkt1414 commented Aug 7, 2024

completes #100

  • fixes and improves logic of s5cmd sync download by utilizing validation results

  • picks 5 smallest series from previous idc versions

  • tests prior versions data rigorosly with all combinations of optional parameters

may be better to create a new issue to support this #100 (comment). So far I do not know an 'elegant' way to support it

vkt1414 added 5 commits August 7, 2024 18:16
when dirTemplate is none, hierarchy is constructed
using sql CONCAT statement as using downloadDir
directly faced issues with duckdb parsing a query

index_crdc_series_uuid column is added to validation
query, as this uuid is helpful for calculating required
downloading w/ s5cmd sync

improve efficiency by reducing the number of columns read
when using prior index

remove the dedicated query to build paths and instead
construct download paths direcly in the validation query

return one additional output as dataframe from validation
function which will be passed to s5cmd_run. This df is
helpful when s5cmd sync is used and download size needs to
be calculated. Eliminates redundancy
The dataframe contains "index_crdc_series_uuid", "s5cmd_cmd",
"series_size_MB", "path" columns

The above df is passed as s5cmd_sync_helper_df in _s5cmd_run
method

simplify _parse_s5cmd_sync_output_and_generate_synced_manifest method
definition utilizing the s5cmd_sync_helper_df

test manifest is edited to pick the 5 smallest series from
previous index
@vkt1414 vkt1414 force-pushed the feat-add-metadata-for-series-not-in-latest-index-for-idc-2 branch from da9e95a to 95d555f Compare August 8, 2024 02:06
@vkt1414 vkt1414 marked this pull request as ready for review August 8, 2024 02:44
@vkt1414 vkt1414 requested a review from fedorov August 8, 2024 03:10
@fedorov fedorov merged commit 9f7615b into ImagingDataCommons:main Sep 29, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants