Skip to content

Commit

Permalink
Add section on multi-pass subsampling
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin committed Feb 23, 2024
1 parent 5ce5902 commit 5dfc2f3
Showing 1 changed file with 140 additions and 0 deletions.
140 changes: 140 additions & 0 deletions docs/usage/cli/filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,143 @@ The filter command allows you to select a specific number of sequences from spec
--output filtered.fasta
This subsampling and filtering will reduce the number of sequences in the tutorial data set from 34 to 24.

Subsampling beyond the capabilities of ``augur filter``
-------------------------------------------------------

There are some subsampling strategies in which a single call to ``augur filter``
does not suffice. One such strategy is "tiered subsampling". In this strategy,
mutually exclusive sets of filters, each representing a "tier", are sampled with
different subsampling rules. This is commonly used to create geographic tiers.
Consider this subsampling scheme:

Sample 20 sequences from North America and 50 sequences from Europe.

This cannot be done in a single call to ``augur filter``. Instead, it can be
decomposed into multiple schemes, each handled by a single call to ``augur
filter``. Additionally, there is an extra step to combine the intermediate
samples.

1. Sample 20 sequences from North America.
2. Sample 50 sequences from Europe.
3. Combine the samples.

Each intermediate sample can be created using ``augur filter``, then joined
together to create the final subsample.

.. code-block:: bash
# 1. Sample 20 sequences from North America
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "region == 'North America'" \
--subsample-max-sequences 20 \
--output-sequences north_america_sample.fasta \
--output-metadata north_america_sample.tsv
# 2. Sample 50 sequences from Europe
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--query "region == 'Europe'" \
--subsample-max-sequences 50 \
--output-sequences europe_sample.fasta \
--output-metadata europe_sample.tsv
# 3.1 Combine FASTA files
cat north_america_sample.fasta europe_sample.fasta > combined.fasta
# 3.2 Combine metadata files, using tail to skip the extra header row
cat north_america_sample.tsv > combined.tsv
tail -n+2 europe_sample.tsv >> combined.tsv
If no sequence filters are used, there is a better approach:

.. code-block:: bash
# 1. Sample 20 sequences from North America
augur filter \
--metadata metadata.tsv \
--query "region == 'North America'" \
--subsample-max-sequences 20 \
--output-strains north_america_sample_strains.txt
# 2. Sample 50 sequences from Europe
augur filter \
--metadata metadata.tsv \
--query "region == 'Europe'" \
--subsample-max-sequences 50 \
--output-metadata europe_sample_strains.txt
# 3. Combine using augur filter
augur filter \
--sequences sequences.fasta \
--metadata metadata.tsv \
--exclude-all \
--include north_america_sample_strains.txt \
europe_sample_strains.txt \
--output-sequences combined.fasta \
--output-metadata combined.tsv
Differences:

- In the first two steps, ``--input--sequences`` is removed.
- In the first two steps, ``--output-sequences`` and ``--output-metadata`` are
replaced by ``--output-strains``.
- The third step of combining the two intermediate samples is done using ``augur
filter`` itself. This uses ``augur filter`` to sample the data based on an
exact list of strains using a combination of `--exclude-all` and `--include`.

For large datasets, this has a computational benefit. Reading through a large
FASTA file can take a long time. Instead of reading through it twice in the
first two steps, read through it once in the final step.

The approach above can be cumbersome with multiple intermediate samples. Another
approach is to generalize it using Snakemake.

In a Snakemake configuration file:

.. code-block:: yaml
subsampling:
north_america: --query "region == 'North America'" --subsample-max-sequences 20
europe: --query "region == 'Europe'" --subsample-max-sequences 50
In a Snakefile:

.. code-block:: python
rule subsample:
input:
metadata = "data/metadata.tsv",
output:
strains = "results/{sample_name}_sample_strains.txt",
params:
augur_filter_args = lambda wildcards: config.get("subsampling", {}).get(wildcards.sample_name, "")
shell:
"""
augur filter \
--metadata {input.metadata} \
{params.augur_filter_args} \
--output-strains {output.strains}
"""
rule combine_intermediate_samples:
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv",
intermediate_sample_strains = expand("results/{sample_name}_sample_strains.txt", sample_name=list(config.get("subsampling", {}).keys()))
output:
sequences = "results/combined.fasta",
metadata = "results/combined.tsv",
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--exclude-all \
--include {input.intermediate_sample_strains} \
--output-sequences {output.sequences} \
--output-metadata {output.metadata}
"""

0 comments on commit 5dfc2f3

Please sign in to comment.