-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest NCBI GenBank data for H5N1 outbreak #37
Comments
Hi from Nextstrain's biggest fan/lurker. 🙂
That can work with some effort for sequences submitted to GenBank, but not for sequences submitted to ENA (and then imported to GenBank as part of the INSDC partnership). ENA forbids useful naming in the title. Unfortunately you have to go through BioSample (either efetch (ugh! takes forever) or NCBI datasets JSON, much nicer but then there's the occasional invalid zip error) in order to get isolate names for sequences that were submitted to ENA. Until recently, I was usually able to get decent metadata for all ~million influenza A sequences, including the /strain and /isolate keywords from GenBank, using this vvsearch2 query, and then I got fasta sequence using this datasets command. But in the past week both the Virus query and datasets command started failing. [I intend to switch over to using only NCBI datasets at some point (not the vvsearch2 query), and ultimately to fetch only the newest sequences instead of all sequences, but dunno when I'll actually get around to that.] |
Thanks for the tip @AngieHinrichs, that's super helpful! |
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>
Based on the vvsear2 query with However, when I've been continuously running into an error when fetching:
@AngieHinrichs is this what you mean by the Virus query failing in the past week? |
Yep. I used to get that error every once in a while, but in the past week it has been happening every time for influenza (similar queries still work for RSV, dengue, MPXV 🤞). |
Also tried with wget and continuously see this error:
|
The vvsearch2 query worked for me just now! |
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>
Closing since we have the base ingest workflow set up in #40. |
Jotting down concrete steps for ingesting NCBI GenBank data for H5N1 outbreak based on internal team discussion and GDoc notes.
Original plan that was discussed:
datasets
command and shows 2,561 records on NCBI Virus.datasets
to download dataset for accessionsBio.Entrez.efetch
to fetch the GenBank records to parse out additional fields that are not included in the dataset metadata: strain, serotype, segment.Detours
I waffled a little on whether we needed (3), because I realized that the GenBank
Title
is included in the FASTA headers of the sequences downloaded! We could potentially parse out strain, serotype, and segment from titles such asHowever, I'm not sure that all record titles will follow this format. GenBank docs for the
Definition
/Title
does not make me confident about it eitherI think I'll stick with the original plan to use
Entrez
, but it was a nice thought.(1) is currently not possible because the download function from NCBI Virus is broken due to a bug for H5N1 (Slack thread).
I thought we could just download all of the influenza genomes then filter locally, but I'm hitting the invalid zip archive error when running
I'm not hitting the error with some minimum filtering on release date and geo-location, so we can go with this:
This returns 32,262 records. Grep for
H5N
in thegenomic.fna
brought this down to 3,275 which is closer to the number on NCBI Virus, though I cannot easily check to see they are the same sequences.The text was updated successfully, but these errors were encountered: