-
Notifications
You must be signed in to change notification settings - Fork 130
Commit 87ca73c
committed
Iterate through metadata in chunks
Instead of loading all metadata into memory at once, iterate through
fixed-size chunks of data, applying filters to these data and either
grouping data into priority queues or streaming output to disk, as
needed. This approach generally follows the original pseudo-code
solution [1]. Users can override the default chunk size with the new
`--metadata-chunksize` argument to tune the amount of memory used by a
given execution of the filter command. A larger chunk size uses more
memory but may also run slightly faster.
One side-effect of this implementation is that it allows us to log the
reason why each strain was filtered or force-included in a new
`--output-log` argument. One of the output columns of the log file is a
kwargs column that tracks the argument passed to a given filter. This
column is structured text in JSON format which allows for more
sophisticated reporting by specific keys.
Along with this change, we apply the include/exclude logic from files
per file so we can track which specific file was responsible for
including or filtering each strain.
Note that we don't use context manager for CSV reading here. In version
1.2, pandas.read_csv was updated to act as a context manager when
`chunksize` is passed but this same version increased the minimum Python
version supported to 3.7. As a result, pandas for Python 3.6 does not
support the context manager `with` usage. Here, we always iterate
through the `TextFileReader` object instead of using the context
manager, an approach that works in all Python versions.
Finally, this commit changes the subsample seed argument type to an
`int` instead of string or int to match numpy's requirement for its
random generator seed [2]. We do not pass a seed value to numpy random
generator prior to Poisson sampling or the generator will always sample
the same values for a given mean (i.e., all queues will have the same
size). Use the random seed for generating random priorities when none
are provided by the user.
Fixes #424
[1] #699 (comment)
[2] https://numpy.org/doc/stable/reference/random/generator.html1 parent 4877de0 commit 87ca73cCopy full SHA for 87ca73c
File tree
4 files changed
+772
-372
lines changed- augur
- tests/functional
- filter
4 files changed
+772
-372
lines changed
0 commit comments