- Default Build Name: WNV Global
- State Based Build Name: WNV Washington Focused Build
- Pathogen/Strain: West Nile Virus
- Scope: Full genome
- Purpose: This repository analyzes West Nile Viral (WNV) genomes using Nextstrain to understand the circulation and transmission of WNV globally (WNV Global build) and within Washington State (WNV Washington Focused Build). This repository was developed based on the WNV repository used for the Twenty years of West Nile Virus in the Americas Nextstrain Narrative
- Nextstrain Build/s Location/s: [Insert the URL for the Nextstrain build on Nextstrain Groups] [Insert another URL for instances when more than one Nextstrain build exists]
- Getting Started
- Run the Build
- Repository File Structure Overview
- Expected Outputs
- Scientific Decisions
- Customization for Local Adaptation
- Contributing
- License
- Acknowledgements
Some high-level features and capabilities specific to this build include:
- Lineage Designation: We use Pathoplexus for clade calling based off of a Nextclade dataset in this PR
- Subsampling: The WNV Washington Focused Build uses a tiered subsampling strategy which allows for filtering NCBI data based on geographic location. The subsampling criteria in the WNV Washington Focused Build is set to select all sequences from Washington, neighboring states, and region, up to a maximum of 5,000 sequences. Additionally, up to 300 sequences are randomly selected from other states. These criteria can be modified as needed.
- Mapping Specific Locations: We have added the option to map specific locations using coordinates in the WNV Washington Focused Build. This feature is useful for a state that needs to map the locations of mosquito traps, for example.
This build pulls WNV genomes that are publicly available from NCBI.
- Sequence and Metadata Data: NCBI
- Expected Inputs:
ingest/data/sequences.fasta
(containing WNV genome sequences)ingest/data/metadata.tsv
(with relevant sample information)
- Private geolocation data, if applicable:
phylogenetic/defaults/wa/annotations.tsv
(containing location name, latitude, and longitude information)
Follow the standard installation instructions for Nextstrain's suite of software tools.
git clone https://github.com/nextstrain/WNV.git
cd WNV
Try running Augur and Auspice
augur -help
auspice -help
This build can process and output global or Washington state focused WNV information.
To run the build by workflows first run the ingest workflow
nextstrain build ingest
Inside the ingest folder there should be two output files: metadata.tsv and sequences.fasta
Run the phylogenetic workflow Execute the global build
nextstrain build phylogenetic
Or execute the Washington focused build
nextstrain build phylogenetic --configfile build-configs/washington-state/config.yaml
Inside the phylogenetic folder there should be at least one output file: WNV_{build name}.json
This Nextstrain build follows the structure detailed in the Pathogen Repo Guide. Mainly, this build contains two workflows for the analysis of WNV data:
- ingest/ Download data from NCBI, clean, format, curate it, and assign clades.
- phylogenetic/ Subsample data and make phylogenetic trees for use in nextstrain.
After successfully running the build there will be two output folders containing the build results.
phylogenetic/auspice/
folder contains: a file calledWNV_{build name}.json
results/
folder contains: multiple intermediate files which include the aligned sequences, subsampled sequences, and phylogenetic trees in .nwk format
The following are critical decisions that were made during the development of the WNV build that should be kept in mind when analyzing the data.
This build can process and output global or Washington state focused WNV information. To accomplish this, a washington-state.yaml file was added to the build-configs which specifies Washington subsampling preferences. This file can be adopted and modified to accommodate other sampling references appropriate to other regions or states.
The Global and the Washington focused WNV builds use different references.
The Global WNV build uses the reference sequence AF260968 which is the first WNV L1 (cluster 1) strain recovered in Egypt from 1951. Mencattelli, G., Ndione, M.H.D., Silverj, A. et al. Spatial and temporal dynamics of West Nile virus between Africa and Europe. Nat Commun 14, 6440 (2023). https://doi.org/10.1038/s41467-023-42185-7
The Washington focused WNV build uses the reference sequence AF481864 as this is the sequence that is most closely related to the sequences isolated from New York in 1999. Hadfield J, Brito AF, Swetnam DM, Vogels CBF, Tokarz RE, Andersen KG, Smith RC, Bedford T, Grubaugh ND. Twenty years of West Nile virus spread and evolution in the Americas visualized by Nextstrain. PLoS Pathog. 2019 Oct 31;15(10):e1008042. doi: 10.1371/journal.ppat.1008042. PMID: 31671157; PMCID: PMC6822705.
For global lineage designations, we query pathoplexus
We further refined the information in the NCBI Host column by categorizing it into Host_Genus and Host_Type, creating broader groupings for more effective data analysis. For example, the Host Homo sapiens is classified under Host_Genus as Homo and Host_Type as Human. This broader categorization is particularly useful for visualizing host information on the phylogenetic tree. Instead of distinguishing between individual mosquito species, you can use the broader categories like Host_Genus Culex or the higher-level category Host_Type Mosquito to color the tips of the tree.
The average genome length of WNV is 10,948 bp. We evaluated minimum genome length thresholds of 90% (9,800 bp), 80% (8,700 bp), 75% (8,200 bp), and 70% (7,700 bp). For each threshold, we ran the Washington-focused build and compared: (1) the number of sequences included, (2) data gap locations in the alignment files using an alignment viewer, and (3) the topology and lineage assignments from the phylogenetic tree outputs to determine the optimal threshold. We concluded that a minimum genome length of 75% (8,200 bp) included a higher number of sequences while balancing alignment quality. Lastly, we validated this threshold using the global build.
- To modify the minimum length of nucleotide sequence in the WNV global build enter the desired threshold in the --min-length <MIN_LENGTH> parameter that is listed in the defaults/config.yaml file
- To modify the minimum length of nucleotide sequence in the WNV Washington focused build enter the desired threshold in the --min-length <MIN_LENGTH> parameter that is listed in the washington-state/config.yaml file.
This build can be customized for use by other demes, including as states, cities, counties, or countries.
The Washington focused WNV build retrieves all available WNV sequences from NCBI and filters the data within the phylogenetic workflow based on criteria defined in the build-configs/washington-state/config.yaml file. For details on the current subsampling configuration and instructions on modifying the criteria, refer to the phylogenetic/build-configs/washington-state README.md.
We have added the option to integrate additional metadata, which can include either public or sensitive information. This feature is especially useful for state health departments that need to annotate the phylogenetic trees or map visualizations in Auspice. For example, in the Washington focused WNV build, we mapped the centroids of zip codes where mosquito traps are located. This information is located in the phylogenetic/data-private/metadata.tsv folder. For more details on the current metadata configuration and instructions on modifying it, refer to the phylogenetic/data-private/README.md.
For any questions please submit them to our [Discussions](insert link here) page otherwise software issues and requests can be logged as a Git [Issue](insert link here).
[add acknowledgements to those who have contributed to this work]