Skip to content

Commit 87ca73c

Browse files
committed
Iterate through metadata in chunks
Instead of loading all metadata into memory at once, iterate through fixed-size chunks of data, applying filters to these data and either grouping data into priority queues or streaming output to disk, as needed. This approach generally follows the original pseudo-code solution [1]. Users can override the default chunk size with the new `--metadata-chunksize` argument to tune the amount of memory used by a given execution of the filter command. A larger chunk size uses more memory but may also run slightly faster. One side-effect of this implementation is that it allows us to log the reason why each strain was filtered or force-included in a new `--output-log` argument. One of the output columns of the log file is a kwargs column that tracks the argument passed to a given filter. This column is structured text in JSON format which allows for more sophisticated reporting by specific keys. Along with this change, we apply the include/exclude logic from files per file so we can track which specific file was responsible for including or filtering each strain. Note that we don't use context manager for CSV reading here. In version 1.2, pandas.read_csv was updated to act as a context manager when `chunksize` is passed but this same version increased the minimum Python version supported to 3.7. As a result, pandas for Python 3.6 does not support the context manager `with` usage. Here, we always iterate through the `TextFileReader` object instead of using the context manager, an approach that works in all Python versions. Finally, this commit changes the subsample seed argument type to an `int` instead of string or int to match numpy's requirement for its random generator seed [2]. We do not pass a seed value to numpy random generator prior to Poisson sampling or the generator will always sample the same values for a given mean (i.e., all queues will have the same size). Use the random seed for generating random priorities when none are provided by the user. Fixes #424 [1] #699 (comment) [2] https://numpy.org/doc/stable/reference/random/generator.html
1 parent 4877de0 commit 87ca73c

File tree

4 files changed

+772
-372
lines changed

4 files changed

+772
-372
lines changed

0 commit comments

Comments
 (0)