Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add host taxonomic categories #8

Merged
merged 5 commits into from
Aug 20, 2024
Merged

Add host taxonomic categories #8

merged 5 commits into from
Aug 20, 2024

Conversation

kimandrews
Copy link
Contributor

Description of proposed changes

Assigns host taxa to taxonomic groupings that are relevant to rabies for coloring in auspice through the following steps:

  • Extract the Host Taxonomic ID numbers from the metadata output by NCBI Datasets
  • Obtain detailed taxonomic info for each host taxon by inputting the Host Taxonomic ID numbers into the NCBI Datasets Taxonomy Package using the datasets download taxonomy function
  • Add the detailed taxonomic info to the metadata
  • Use a custom python script to assign host taxa to taxonomic categories that are relevant for rabies

NOTE: Testing this PR requires running the ingest workflow and then copy/pasting the output from ingest/results to phylogenetic/data. This is because the ingest output that is updated daily on S3 does not include the modifications that are made by the ingest workflow and used in the phylogenetic workflow in this PR.

An example tree generated with these changes can be viewed here

Related issue(s)

Checklist

  • Checks pass

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but really glad to see this work out! It'll definitely pave the path for generalized host curation in other pathogen repos.

My only requested change is the use of the log files, everything else is non-blocking.

Comment on lines 37 to 53
if record[args.hostlatin_field] == "Canis lupus familiaris":
return "Domestic Dog"
elif record[args.hostlatin_field] == "Homo sapiens":
return "Human"
elif record[args.hostlatin_field] == "Bos taurus":
return "Cattle"
elif record[args.hostlatin_field] in ["Didelphis albiventris", "Elephas maximus", "Dasypus novemcinctus"]:
return "Other Mammal"
elif record[args.hostfamily_field] == "Mephitidae":
return "Skunk"
elif record[args.hostfamily_field] == "Canidae" and record[args.hostgenus_field] == "Vulpes":
return "Fox (Vulpes sp.)"
elif record[args.hostfamily_field] == "Procyonidae" and record[args.hostgenus_field] == "Procyon":
return "Raccoon"
else:
host_group = record[args.hostgroup_field].lower()
return replacements.get(host_group, record[args.hostgroup_field])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style nitpick

This long if/elif/else can be a little hard to follow. Thoughts on pulling out these individual checks into other dicts like replacements?

(pseudo-code)

latin_replacements = { "Canis lupus familiaris": "Domestic Dog", ... }
family_replacements = { "Mephitidae": "Skunk", ... }
group_replacements = { "odd-toed ungulates": "Other Ungulate", ... } 

latin_field = record[args.hostlatin_field]
family_field = record[args.hostfamily_field]
group_field = record[args.hostgroup_field]

if latin_field in latin_replacements:
    return latin_replacements[latin_field]
elif family_field in family_replacements:
    return family_replacements[family_field]
elif group_field in group_replacements:
    return group_replacements[group_field]
else:
    return group_field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 618bbd5. Current version still uses elif for situations in which both the family and genus need to match certain values. Happy to incorporate further suggestions to make this more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing out the situations that match both family and genus. I totally missed that the first time!
Happy to leave as-is and this did get my brain thinking that we might be able to generalize this by mocking the hierarchical geo-location rules 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not asking for changes in this PR, just throwing out the idea to see if it makes sense!)

The host rules can be formatted like <group>/<family>/<genus>/<species>\t<new_host_label>
Picking a couple examples from your script:

host_hierarchy new_host_label
odd-toed ungulates/*/*/* Other Ungulate
*/Mephitidae/*/* Skunk
*/Canidae/Vulpes/* Fox (Vulpes sp.)
*/Procyonidae/Procyon/* Raccoon
*/*/*/Canis lupus familiaris Domestic Dog

Then the generalized script would match starting from group down to species.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an official propose for augur curate format-host in nextstrain/augur#1586

@kimandrews kimandrews force-pushed the add-host-categories branch from 44f842b to 2b09644 Compare August 20, 2024 00:56
* Extract the Host Taxonomic ID numbers from the metadata output by NCBI Datasets
* Obtain detailed taxonomic info for each host taxon by inputting the Host Taxonomic ID numbers into the NCBI Datasets Taxonomy Package using the `datasets download taxonomy` function
* Add the detailed taxonomic info to the metadata
Assign host taxa to taxonomic categories that are relevant for rabies using a custom script during the `curate` workflow
Color by the new host category, host latin name, and host common name
Comment on lines +26 to +43
latin_replacements = {
"Canis lupus familiaris": "Domestic Dog",
"Homo sapiens": "Human",
"Bos taurus": "Cattle",
"Didelphis albiventris": "Other Mammal",
"Elephas maximus": "Other Mammal",
"Dasypus novemcinctus": "Other Mammal"}
family_replacements = {"Mephitidae": "Skunk"}
group_replacements = {
"odd-toed ungulates": "Other Ungulate",
"even-toed ungulates & whales": "Other Ungulate",
"carnivores": "Other Carnivore",
"bats": "Bat",
"birds": "Bird",
"primates": "Other Mammal",
"rodents": "Other Mammal",
"mammals": "Other Mammal"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, but this would end up being more generic (and hence easier to ultimately move into augur or re-use in other repos) if these were provided via config files passed as CLI args.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure! I wasn't sure how to make the config file generalized, but it just occurred to me that we can borrow the format of the hierarchical geolocation rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss the generic version in nextstrain/augur#1586

@kimandrews kimandrews merged commit 67b7941 into main Aug 20, 2024
5 checks passed
@kimandrews kimandrews deleted the add-host-categories branch August 20, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants