-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add host taxonomic categories #8
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, but really glad to see this work out! It'll definitely pave the path for generalized host curation in other pathogen repos.
My only requested change is the use of the log
files, everything else is non-blocking.
if record[args.hostlatin_field] == "Canis lupus familiaris": | ||
return "Domestic Dog" | ||
elif record[args.hostlatin_field] == "Homo sapiens": | ||
return "Human" | ||
elif record[args.hostlatin_field] == "Bos taurus": | ||
return "Cattle" | ||
elif record[args.hostlatin_field] in ["Didelphis albiventris", "Elephas maximus", "Dasypus novemcinctus"]: | ||
return "Other Mammal" | ||
elif record[args.hostfamily_field] == "Mephitidae": | ||
return "Skunk" | ||
elif record[args.hostfamily_field] == "Canidae" and record[args.hostgenus_field] == "Vulpes": | ||
return "Fox (Vulpes sp.)" | ||
elif record[args.hostfamily_field] == "Procyonidae" and record[args.hostgenus_field] == "Procyon": | ||
return "Raccoon" | ||
else: | ||
host_group = record[args.hostgroup_field].lower() | ||
return replacements.get(host_group, record[args.hostgroup_field]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code style nitpick
This long if/elif/else
can be a little hard to follow. Thoughts on pulling out these individual checks into other dicts like replacements
?
(pseudo-code)
latin_replacements = { "Canis lupus familiaris": "Domestic Dog", ... }
family_replacements = { "Mephitidae": "Skunk", ... }
group_replacements = { "odd-toed ungulates": "Other Ungulate", ... }
latin_field = record[args.hostlatin_field]
family_field = record[args.hostfamily_field]
group_field = record[args.hostgroup_field]
if latin_field in latin_replacements:
return latin_replacements[latin_field]
elif family_field in family_replacements:
return family_replacements[family_field]
elif group_field in group_replacements:
return group_replacements[group_field]
else:
return group_field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 618bbd5. Current version still uses elif
for situations in which both the family and genus need to match certain values. Happy to incorporate further suggestions to make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing out the situations that match both family and genus. I totally missed that the first time!
Happy to leave as-is and this did get my brain thinking that we might be able to generalize this by mocking the hierarchical geo-location rules 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not asking for changes in this PR, just throwing out the idea to see if it makes sense!)
The host rules can be formatted like <group>/<family>/<genus>/<species>\t<new_host_label>
Picking a couple examples from your script:
host_hierarchy | new_host_label |
---|---|
odd-toed ungulates/*/*/* | Other Ungulate |
*/Mephitidae/*/* | Skunk |
*/Canidae/Vulpes/* | Fox (Vulpes sp.) |
*/Procyonidae/Procyon/* | Raccoon |
*/*/*/Canis lupus familiaris | Domestic Dog |
Then the generalized script would match starting from group
down to species
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an official propose for augur curate format-host
in nextstrain/augur#1586
44f842b
to
2b09644
Compare
* Extract the Host Taxonomic ID numbers from the metadata output by NCBI Datasets * Obtain detailed taxonomic info for each host taxon by inputting the Host Taxonomic ID numbers into the NCBI Datasets Taxonomy Package using the `datasets download taxonomy` function * Add the detailed taxonomic info to the metadata
Assign host taxa to taxonomic categories that are relevant for rabies using a custom script during the `curate` workflow
Color by the new host category, host latin name, and host common name
2b09644
to
a245d95
Compare
latin_replacements = { | ||
"Canis lupus familiaris": "Domestic Dog", | ||
"Homo sapiens": "Human", | ||
"Bos taurus": "Cattle", | ||
"Didelphis albiventris": "Other Mammal", | ||
"Elephas maximus": "Other Mammal", | ||
"Dasypus novemcinctus": "Other Mammal"} | ||
family_replacements = {"Mephitidae": "Skunk"} | ||
group_replacements = { | ||
"odd-toed ungulates": "Other Ungulate", | ||
"even-toed ungulates & whales": "Other Ungulate", | ||
"carnivores": "Other Carnivore", | ||
"bats": "Bat", | ||
"birds": "Bird", | ||
"primates": "Other Mammal", | ||
"rodents": "Other Mammal", | ||
"mammals": "Other Mammal" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking, but this would end up being more generic (and hence easier to ultimately move into augur or re-use in other repos) if these were provided via config files passed as CLI args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure! I wasn't sure how to make the config file generalized, but it just occurred to me that we can borrow the format of the hierarchical geolocation rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss the generic version in nextstrain/augur#1586
Description of proposed changes
Assigns host taxa to taxonomic groupings that are relevant to rabies for coloring in auspice through the following steps:
datasets download taxonomy
functionNOTE: Testing this PR requires running the ingest workflow and then copy/pasting the output from
ingest/results
tophylogenetic/data
. This is because the ingest output that is updated daily on S3 does not include the modifications that are made by the ingest workflow and used in the phylogenetic workflow in this PR.An example tree generated with these changes can be viewed here
Related issue(s)
Checklist