-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use io.read_metadata during export #909
Conversation
Replaces a call to the older `utils.read_metadata` function with the newer `io.read_metadata` function while processing metadata for export to an Auspice JSON. This new function returns a pandas DataFrame indexed by the first viable strain name column found in the metadata file (removing this column from the data itself), while the original function returns a dictionary indexed by strain name (keeping the original named column like `strain` or `name` in the data). To avoid changing the downstream code that consumes the metadata, this commit converts the pandas DataFrame to a dictionary that matches the output of the original function. The main advantage here is that the calling code does not need to know what the id column is named, since `io.read_metadata` handles this and indexed the data frame by that column. This commit also adds functional tests for the expected behavior of export v2 with metadata inputs. Fixes #905
Splits up the first load of the metadata (where we check for available id columns) into separate variables for metadata and the first chunk and then explicitly closes the metadata file after the read. This close call is required to prevent seemingly random "unclosed file" warnings from appearing in stderr.
Codecov Report
@@ Coverage Diff @@
## master #909 +/- ##
==========================================
+ Coverage 34.63% 36.81% +2.18%
==========================================
Files 42 42
Lines 6006 6886 +880
Branches 1538 1959 +421
==========================================
+ Hits 2080 2535 +455
- Misses 3853 4221 +368
- Partials 73 130 +57
Continue to review full report at Codecov.
|
Adds a functional test for how we'd like augur export v2 to behave when the user provides metadata without a valid id and then updates the try/except block in export_v2 to catch and print the error from `read_metadata` when it occurs. While we're at it, this commit also prints the error for missing metadata file to stderr instead of stdout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
For our internal code it seems clear that we should move to the augur.io
function. Currently export v1, refine, frequencies, titers and traits still use read_metadata. Each of them look simple to update using a similar pattern to what you've used here, e.g.:
# untested
metadata_df = read_metadata(args.metadata)
columns = metadata_df.columns
meta_tsv = metadata_df.to_dict(orient="index")
for strain in meta_tsv.keys():
meta_tsv[strain]["strain"] = strain
We could then add a deprecation warning to the util function. In a future major release we can then remove:
- augur.utils -> read_metadata
- all of augur.util_support.metadata_file
- tests/util_support/test_metadata_file.py
If this seems sensible we can add the above to a new issue to tackle.
sys.exit(2) | ||
except Exception as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add this to a list (or somesuch) as this shouldn't be caught once #903 is implemented. Maybe tagging the issue here is enough as it'll now show up on that issue...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I exercised restraint by not just coding up #903 in this PR 😄 My issue with the implementation here is that the exception raised by read_metadata
is an "expected" exception, but it isn't specific enough to catch by itself here without catching all other exceptions. We should really raise a custom IOException
or similar instead.
Description of proposed changes
Replaces a call to the older
utils.read_metadata
function with the newerio.read_metadata
function while processing metadata for export to an Auspice JSON. This new function returns a pandas DataFrame indexed by the first viable strain name column found in the metadata file (removing this column from the data itself), while the original function returns a dictionary indexed by strain name (keeping the original named column likestrain
orname
in the data).To avoid changing the downstream code that consumes the metadata, this commit converts the pandas DataFrame to a dictionary that matches the output of the original function. The main advantage here is that the calling code does not need to know what the id column is named, since
io.read_metadata
handles this and indexed the data frame by that column.Importantly, the
io.read_metadata
function defines a list of accepted id columns to look for and will raise an exception with a helpful error message when the input metadata has no valid id columns. Commit 45df6bd adds a check for this exception in the augur export v2 command such that the user will get a helpful error message on the command line instead of an uncaught exception. An example from a new functional test demonstrates this behavior:I preferred this implementation over the check for valid id columns in #906 because:
read_metadata
function with a maintained version,read_metadata
),Related issue(s)
Fixes #905
Testing