Investigate incomplete strain name A//USA//2024
from SRA data
#125
Labels
bug
Something isn't working
A//USA//2024
from SRA data
#125
When I was looking into #122, I noticed an incomplete strain name
A//USA//2024
in the SRA data.The SRA strain names are constructed from the metadata, so the parts of the strain name implies the metadata is missing host and isolate id. Looking at the raw metadata at https://github.com/andersen-lab/avian-influenza/blob/3eae93cc77cec4515686f47b098dfdb2837acb22/metadata/SraRunTable_automated.csv, I realized that all SRA records under the BioProject PRJNA1134696 are missing host and isolate id. All of these records are being "deduped" as the single strain
A//USA//2024
.We should dig into the data for BioProject PRJNA1134696 to figure out what fields should be used to properly construct the strain name. However, this is not urgent because these SRA records are linked to 23 BioSample records that also have linked GenBank records. I verified that we are including these sequences from GenBank in the h5n1-cattle-outbreak build.
The text was updated successfully, but these errors were encountered: