Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate incomplete strain name A//USA//2024 from SRA data #125

Open
joverlee521 opened this issue Feb 7, 2025 · 0 comments
Open

Investigate incomplete strain name A//USA//2024 from SRA data #125

joverlee521 opened this issue Feb 7, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@joverlee521
Copy link
Contributor

When I was looking into #122, I noticed an incomplete strain name A//USA//2024 in the SRA data.

The SRA strain names are constructed from the metadata, so the parts of the strain name implies the metadata is missing host and isolate id. Looking at the raw metadata at https://github.com/andersen-lab/avian-influenza/blob/3eae93cc77cec4515686f47b098dfdb2837acb22/metadata/SraRunTable_automated.csv, I realized that all SRA records under the BioProject PRJNA1134696 are missing host and isolate id. All of these records are being "deduped" as the single strain A//USA//2024.

We should dig into the data for BioProject PRJNA1134696 to figure out what fields should be used to properly construct the strain name. However, this is not urgent because these SRA records are linked to 23 BioSample records that also have linked GenBank records. I verified that we are including these sequences from GenBank in the h5n1-cattle-outbreak build.

@joverlee521 joverlee521 added the bug Something isn't working label Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant