ENH: Provide helpful error message when metadata file doesn't contain "strain" column #905

corneliusroemer · 2022-04-26T14:29:19Z

A lot of users seem to get the following type of error:

Job 3: Exporting data files for for auspice


        augur export v2             --tree results/global/tree.nwk             --metadata data/metadata.tsv
    --node-data results/global/branch_lengths.json results/global/nt_muts.json results/global/aa_muts.json results/global/subclades.json results/global/clades.json results/global/recency.json results/global/traits.json             --auspice-config my_profiles/covid/my_auspice_config.json             --include-root-sequence             --colors results/global/colors.tsv             --lat-longs defaults/lat_longs.tsv             --title 'Genomic epidemiology of novel coronavirus - Global subsampling'             --description my_profiles/covid/my_description.md             --output results/global/ncov_with_accessions.json 2>&1 | tee logs/export_global.txt

    Validating schema of 'results/global/aa_muts.json'...
    Traceback (most recent call last):
      File "/home/charbel/miniconda3/envs/nextstrain/bin/augur", line 10, in <module>
    sys.exit(main())
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export.py", line 22, in run
    return run_v2(args)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export_v2.py", line 903, in run_v2
    node_data, node_attrs, node_data_names, metadata_names = parse_node_data_and_metadata(T, args.node_data, args.metadata)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export_v2.py", line 863, in parse_node_data_and_metadata
    if node["strain"] in node_attrs: # i.e. this node name is in the tree
    KeyError: 'strain'

https://discussion.nextstrain.org/t/error-in-job-3-exporting-data-files-for-for-auspice/493/4

It's a common discussion topic on our forum and also in emails we get to [email protected]

I think it would help users a lot if we raised a more informative error so that users know directly how to fix it.

Also, we don't seem to have documented the requirement that the metadata needs to contain a column called strain with strainnames.

Both should be addressed.

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2022-04-26T15:05:23Z

Interestingly, when reading in a metadata file, we seem to be ok with name or strain but then in export we suddenly don't accept name anymore. That's strange.

Should we remove support for name or make export accept name to be in line with metadata_file.py, see:

augur/augur/util_support/metadata_file.py

Lines 6 to 12 in 4b71e7d

    
           class MetadataFile: 
        
               """ 
        
               Represents a CSV or TSV file containing metadata 
        
               The file must contain exactly one of a column named `strain` or `name`, 
        
               which is used to match metadata with samples. 
        
               """

huddlej · 2022-04-26T20:31:05Z

We do support searching for multiple arbitrary strain ids when reading in metadata with the read_metadata function in the io module. This function returns a data frame indexed by the first requested id column that exists in the input. As a result, the calling code can consume the data frame without needing to know what the name of the id column is.

An alternate solution to #906 is to use io.read_metadata in the export module instead of the current call to utils.read_metadata. We could cast the data frame to a dict to avoid changing other code in the module or we could update the logic in parse_node_data_and_metadata to use the data frame. We should really deprecate the utils.read_metadata function, anyway, since io.read_metadata was written to replace it eventually.

Replaces a call to the older `utils.read_metadata` function with the newer `io.read_metadata` function while processing metadata for export to an Auspice JSON. This new function returns a pandas DataFrame indexed by the first viable strain name column found in the metadata file (removing this column from the data itself), while the original function returns a dictionary indexed by strain name (keeping the original named column like `strain` or `name` in the data). To avoid changing the downstream code that consumes the metadata, this commit converts the pandas DataFrame to a dictionary that matches the output of the original function. The main advantage here is that the calling code does not need to know what the id column is named, since `io.read_metadata` handles this and indexed the data frame by that column. This commit also adds functional tests for the expected behavior of export v2 with metadata inputs. Fixes #905

corneliusroemer added enhancement New feature or request help wanted documentation Improvements or additions to documentation good first issue A relatively isolated issue appropriate for first-time contributors easy problem Requires less work than most issues labels Apr 26, 2022

corneliusroemer added this to Nextstrain planning (archived) Apr 26, 2022

corneliusroemer moved this to New in Nextstrain planning (archived) Apr 26, 2022

corneliusroemer mentioned this issue Apr 26, 2022

fix: accept "name" column for strain names in export #906

Closed

victorlin moved this from New to Prioritized in Nextstrain planning (archived) Apr 27, 2022

victorlin moved this from Prioritized to In Progress in Nextstrain planning (archived) Apr 27, 2022

huddlej mentioned this issue Apr 28, 2022

Use io.read_metadata during export #909

Merged

2 tasks

huddlej closed this as completed in #909 May 5, 2022

Repository owner moved this from In Progress to Done in Nextstrain planning (archived) May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Provide helpful error message when metadata file doesn't contain "strain" column #905

ENH: Provide helpful error message when metadata file doesn't contain "strain" column #905

corneliusroemer commented Apr 26, 2022

corneliusroemer commented Apr 26, 2022

huddlej commented Apr 26, 2022

ENH: Provide helpful error message when metadata file doesn't contain "strain" column #905

ENH: Provide helpful error message when metadata file doesn't contain "strain" column #905

Comments

corneliusroemer commented Apr 26, 2022

corneliusroemer commented Apr 26, 2022

huddlej commented Apr 26, 2022