-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest from NCBI Virus + NCBI Datasets #40
Conversation
Prepping for starting the ingest workflow for NCBI data. Copying over the ingest workflow from the pathgoen-repo-guide¹ as a "custom" build for the ingest workflow because the default ingest for avian-flu uses fauna. I will make changes to adopt the workflow to this repo in subsequent commits. ¹ <https://github.com/nextstrain/pathogen-repo-guide/tree/f33c43edd9ebad10aa0e8d2b0791755ddbe2f5c8/ingest>
1. Remove `ingest/build-configs/nbci/vendored` -- We will be using the shared `ingest/vendored` scripts that already exist in this repo. No need to vendored inception! 2. Remove `ingest/build-configs/ncbi/build-configs` -- the NCBI workflow itself is already a "custom" build. No need for build-configs inception! 3. Remove files and rules related to Nextclade. We can add them back later if we want to integrate Nextclade outputs with the metadata. 4. Remove extra geolocation_rules.tsv, just use the shared ingest/defaults/geolocation_rules.tsv since these fixes should be data source agnostic. 5. Remove boilerplate rules related to fetching from NCBI Entrez. 6. Remove boilerplate text from pathogen-repo-guide.
Use the `custom_rules` to include the NCBI ingest Snakefile with the main workflow. Include file paths updates to namespace the NCBI ingest outputs under `ncbi/`.
Using NCBI taxon id `11320` for Influenza A viruses to include all GenBank records that are then further labeled with serotypes (e.g. H5N1). Filtering down with the `--released-after` and `--geo-location` options for NCBI datasets to get a smaller subset of data that centered around the H5N1 outbreak in the U.S. Based on the filter URL for NCBI virus that was shared by @trvrb: <https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&EnvSample_s=include&VirusLineage_ss=Alphainfluenzavirus,%20taxid:197911&Serotype_s=H5N1%20H5N*&CollectionDate_dr=2023-12-31T00:00:00.00Z%20TO%202024-05-08T23:59:59.00Z&Region_s=North%20America>
subrepo: subdir: "ingest/vendored" merged: "0f55d15" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "0f55d15" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
Copying two scripts that were in nextstrain/ingest before they were removed in <nextstrain/ingest#19> - ncbi-virus-url - fetch-from-ncbi-virus Done in preparation for editing the NCBI Virus URL to be used for ingest because the NCBI Datasets does not include strain, segment, and serotype fields.
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>
Since the API no longer returns the actual genomic sequence, we will need to join the metadata with the sequences FASTA. This is simpler to do with `augur curate passthru` where the metadata is a CSV/TSV file. Then we don't need to copy over the `csv-to-ndjson` script.
Fetch from NCBI Virus to replace the metadata that is downloaded with NCBI Datasets since it includes additional metadata that we want in our builds, i.e. strain, serotype, and segment.
Transform the numeric segments from NCBI Virus into the segment names that are used in fauna.
Following the pattern in fauna and andersen-lab ingest to split the metadata and sequences by segment, then remerging the metadata into single file with an `n_segments` column added.
Running into a couple of GenBank records where their strain name and metadata are exactly the same, but they have different accessions. For example, A/redhead duck/North Carolina/W24-83A/2024 has four different GenBank records: - https://www.ncbi.nlm.nih.gov/nuccore/PP761257 - https://www.ncbi.nlm.nih.gov/nuccore/PP761548 - https://www.ncbi.nlm.nih.gov/nuccore/PP761557 - https://www.ncbi.nlm.nih.gov/nuccore/PP761576 Deduping the the data after they've been split out into segments so that we aren't dropping segments with the same strain name. Sadly, not every record has a BioSample accession, so we can only dedup on strain name.
Add hard-coded values and change metadata column names to match the output of fauna. This will make it easier to run the downstream phylogenetic analysis without making too many changes.
Transforms the scientific name of hosts to the host groupings used in the fauna outputs. Currently just done with manual curation of a host-map.tsv, but there's probably smarter ways to transform the scientific names.
Including the log output because I wanted it to help debug why certain sequences were not getting included in the build.
Makes it easier to use the newly added NCBI ingest data by changing the config param `ingest_source`.
Switch hard-coded strain names to match those from the NCBI ingest. Adding an assert to make sure that we are using the NCBI data source. This currently only supports the local ingest data since I have not added the S3 uploads to the NCBI ingest workflow (yet!). Includes changes to the Auspice config to point to GenBank as the data source. When testing with the new NCBI ingest data, I got a bunch of warnings from `augur export` about filter fields not appearing in the tree: > WARNING: The filter "domestic_status" does not appear as a property on any tree nodes. > WARNING: The filter "h5_label_clade" does not appear as a property on any tree nodes. > WARNING: The filter "gisaid_clade" does not appear as a property on any tree nodes. > WARNING: The filter "authors" does not appear as a property on any tree nodes. > WARNING: The filter "originating_lab" does not appear as a property on any tree nodes. It makes sense that domestic_status, h5_label_clade, gisaid_clade, and originating_lab are not included because they all have "?" as values. Fixing the `author` filter that _should_ be included.
Significantly outgrouped relative to cattle outbreak
4b511fc
to
7a08f1e
Compare
…tput Allows us to include the `genbank_accession` field in the Auspice JSON so that Auspice can automatically generate the link the original GenBank record.
This comment was marked as resolved.
This comment was marked as resolved.
Use a separate description.md for the h5n1-cattle-outbreak to point to the public data sources not used by the usual avian-flu builds. I put this together with the general text from the mpox description.md¹ and the additional text on the H5N1 outbreak from the default avian-flu desciption.md² ¹ <https://github.com/nextstrain/mpox/blob/7f5adc3cab0c3034e455882581cc081f26f5ebbe/phylogenetic/defaults/description.md> ² <https://github.com/nextstrain/avian-flu/blob/master/config/description.md?plain=1#L3>
The strain name contains additional metadata that can supplement the metadata from NCBI. This expects strain names to be formatted as For non-human hosts: A / <host> / <location> / <sample id> / <year> For human hosts: A / <location> / <sample id> / <year> The host and year (as date) are added to the metadata if the original metadata field is empty. The location field is specially handled because the strain name might just contain a more detailed geolocation than the metadata location. In that case, the strain name location is just appended to the existing location field in the GenBank location format <record_location>:<strain_location>.
Including additional hosts were parsed from the strain names following the addition of `parse-metadata-from-strain` in the previous commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; contains several very nitpicky nits picked by a nitwit.
@@ -167,7 +173,7 @@ rule refine: | |||
coalescent = "const", | |||
date_inference = "marginal", | |||
clock_rate = clock_rate, | |||
root_strain = "A/skunk/NewMexico/24006483001/2024" | |||
root_strain = "A/skunk/NewMexico/24-006483-001/2024" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth a comment here about why this strain for the root? feels kinda magic-number-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the closest outgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment in b477716
all_geolocation_rules="ncbi/data/all-geolocation-rules.tsv", | ||
shell: | ||
""" | ||
cat {input.general_geolocation_rules} {input.local_geolocation_rules} >> {output.all_geolocation_rules} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the >>
instead of >
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This was copied from pathogen-repo-guide, which was based on mpox/ingest and leads back to my own commit from ~2 years ago.
What was I thinking 🙃 My best guess is this is left over from when I had a rule to just append the local_geolocation_rules
to the existing general_geolocation_rules
without creating the additional all_geolocation_rules
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 096ce90
…084466/2024 Just realized/remembered that BioSample can contain more detailed geolocation metadata than the GenBank record. This was the only sequence that I could find that needed the additional location data. Will revisit the joining of BioSample data in a future change if we see that it can fill in a lot more data.
Specify command to run to fetch data from NCBI GenBank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your script to parse location from strain name was very clever!
This is basically good to go for me. There was just one thing I caught where annotations don't seem to be being applied.
Additional comments in the annotations file need to be separated by a space after the field value. Previously separating comments by a tab resulted in a warning from the `merge-user-metadata` script: > WARNING: Could not decode annotation line ... We can make the `merge-user-metadata` script smarter about parsing comments in the annotations file, but punting that for now.
Adding some comments to clarify the `root_strain`. Thanks @genehack for calling out the magic-number¹ We will probably want to refactor the workflow to pull out these types of values into a config YAML at some point, but I'm punting on that for now. ¹ <#40 (comment)>
Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹ Prompted me to clea up various other nitpicky nits. ¹ <#40 (review)>
Thanks @genehack for pointing it out in <#40 (comment)>
Applying annotation for divison of `A/domesticcat/UnitedStates/24-009311-006/2024` across the board to use '?' for divisions that are unknown.
Adding some comments to clarify the `root_strain`. Thanks @genehack for calling out the magic-number¹ We will probably want to refactor the workflow to pull out these types of values into a config YAML at some point, but I'm punting on that for now. ¹ <#40 (comment)>
Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹ Prompted me to clea up various other nitpicky nits. ¹ <#40 (review)>
Thanks @genehack for pointing it out in <#40 (comment)>
096ce90
to
66e1903
Compare
Description of proposed changes
NCBI Datasets does not include the
strain
,serotype
, andsegment
in the metadata, so this PR uses the NCBI Virusvvsearch2
API to pull down the metadata and joins it with the sequences downloaded via NCBI Datasets.On this branch, I can run the ingest workflow with
then successfully run the genome build (
using tmp changes in 4b511fc)Still needs some clean up but at least wanted to get this out for people to try out.Major change
This PR completely changes over the h5n1-cattle-outbreak build to the NCBI data.
See the latest build at https://nextstrain.org/staging/avian-flu/ncbi/h5n1-cattle-outbreak/genome
Related issue(s)
Related to #37
Checklist
nextclade run
as part of ingest -> tracking in ingest: Run Nextclade as part of ingest #44