Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference outgroup for 2y/6m builds #89

Merged
merged 3 commits into from
Sep 22, 2022

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Sep 19, 2022

Description of proposed changes

Add reference outgroup for 2y and 6m builds to avoid incorrect rooting of trees. The reference is pruned from the tree after the refine step using James' code snippet¹.

The references added are the references used by Nextclade, but with the strain names changed to match the strain name within fauna.

¹ nextstrain/augur#340 (comment)

(There's a bunch of path changes, but the main change is in Snakefile_base)

Related issue(s)

Related to nextstrain/augur#340

Testing

  • Completed a test run of the live builds on AWS Batch (job id: f059576b-3f9f-413f-9140-e0874e7a67ee)
    • Deployed test run builds to nextstrain.org/staging/include-reference-outgroup/flu/seasonal/*.
    • The main ones to check are the H1N1pdm 2y and 6m builds.
  • Completed a second test run (job id: c2e374a7-5114-48be-a85c-6e1bbc9589e5) with just H1N1pdm 2y and 6m builds to make sure the first run wasn't just good luck ;)
    • Locally checked the builds and they looked reasonable and similar to the first run 🎉

@joverlee521 joverlee521 force-pushed the include-reference-outgroup branch from 1b78a87 to 1f5a23c Compare September 19, 2022 23:41
Add reference outgroup for 2y and 6m builds to avoid incorrect rooting
of trees. The reference is pruned from the tree after the refine step
based on James' code snippet¹.

The references added are the references used by Nextclade, but with the
strain names changed to match the strain name within fauna.

¹ nextstrain/augur#340 (comment)
@joverlee521 joverlee521 force-pushed the include-reference-outgroup branch from 1f5a23c to 03df388 Compare September 19, 2022 23:45
@joverlee521 joverlee521 marked this pull request as ready for review September 20, 2022 20:41
@joverlee521
Copy link
Contributor Author

Created this as a fix for the rooting problem we see in h1n1pdm builds via the master branch.
@huddlej, not sure if we should add a more robust version of this in #76?

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one non-blocking comment and some additional ideas below, but I think this is generally good to merge as a patch to the current workflow.

One minor technical point is that the prune_reference rule should use a shell block that calls a script with the logic in the current run block. Snakemake's run blocks tend to be difficult to debug and don't work with --use-conda.

At first I wondered why we needed to define a new reference strain per lineage when we already have these for alignment (config/reference_{lineage}_{segment}.gb). Then I realized that the reference for alignment doesn't get included in the alignment output from the codon aligner script, so we have to separately add a reference sequence for rooting. This behavior of the codon aligner is different from augur align where the reference is automatically included as the first sequence in the output and from nextalign which provides an --include-reference flag.

Thinking ahead to how we'd implement this in the refactored build, I wanted to see if we could just use the existing alignment reference to root the divergence and time trees. By default, IQ-TREE uses the first sequence in the alignment to root the tree (or you can specify the sequence name to root with using -o <sequence_name>). If we produce a properly rooted tree from IQ-TREE, we can tell augur refine to keep this root and skip any additional rooting logic (--keep-root). We might still need to prune the reference from the time tree, to keep that older strain out of the analysis.

To test this idea, I added a --include-reference argument to the codon alignment script so the reference sequence is the first sequence in the alignment output and I built the following tree from the resulting alignment. The sequence at the root is the reference used for the alignment.

image

Then, I ran augur refine with --keep-root instead of the explicit root. Interestingly, TreeTime kept the rooting from IQ-TREE, but it pruned the reference sequence with the clock filter (since it is so much older than all the other strains in the tree). The resulting tree looked like:

image

The earliest sequence in the tree is the other California/7 reference that was force-included in this branch. The pruned tree that removes California/7 looked like this:

image

Since we already use nextalign and IQ-TREE in the refactored workflow, we could make the following minimal modifications to that workflow to use this rooting approach:

  • run nextalign with --include-reference
  • run augur refine with --keep-root (relying on IQ-TREE's implicit rooting)

There are some issues with this approach. I'm not sure that TreeTime will always drop the root from the tree with the clock filter or we not always enable the clock filter in the future. This approach also relies on the default behavior of IQ-TREE; if we switch tree builders, the root could appear in the tree again. A better approach might be to implement @trvrb's recommended --remove-outgroup flag for augur refine, so we can explicitly specify the sequence to root with and then prune it no matter what other settings are used1. For this approach we would:

  • run nextalign with --include-reference
  • run augur refine with --root <reference_name> and --remove-outgroup

These changes don't require any additional rules or custom scripts and provide a general pattern to use across pathogens. We could probably implement this new flag in augur refine relatively quickly and just use it instead of implementing the less reliable short-term solution I outlined above.

Footnotes

  1. We may want to rename --remove-outgroup to --remove-root, to make the flag name consistent with --root and --keep-root.

Snakemake run blocks are hard to debug and they do not run in
rule-specific conda environments, so we try to avoid them in our
workflows!
Print a warning if reference is empty, but still copy the input tree
to the output tree. This is expected to happen for any resolution that
is not 2y or 6m in our live builds.
@joverlee521 joverlee521 merged commit f3887e6 into master Sep 22, 2022
@joverlee521 joverlee521 deleted the include-reference-outgroup branch September 22, 2022 23:59
@joverlee521
Copy link
Contributor Author

I had to make an additional commit post-merge to fix the WHO builds.

I ran into an AmbiguousRuleException:

AmbiguousRuleException:
Rules prune_reference and refine are ambiguous for the file results/tree_pruned_vidrl_h3n2_ha_6y_cell_fra.nwk.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
	prune_reference: assay=fra,center=vidrl,lineage=h3n2,passage=cell,resolution=6y,segment=ha
	refine: assay=fra,center=pruned_vidrl,lineage=h3n2,passage=cell,resolution=6y,segment=ha
Expected input files:
	prune_reference: results/tree_vidrl_h3n2_ha_6y_cell_fra.nwk
	refine: results/tree-raw_pruned_vidrl_h3n2_ha_6y_cell_fra.nwk results/aligned_pruned_vidrl_h3n2_ha_6y_cell_fra.fasta results/metadata_h3n2_ha.tsvExpected output files:
	prune_reference: results/tree_pruned_vidrl_h3n2_ha_6y_cell_fra.nwk
	refine: results/tree_pruned_vidrl_h3n2_ha_6y_cell_fra.nwk results/branch-lengths_pruned_vidrl_h3n2_ha_6y_cell_fra.json

I assume there's some sort of pattern matching (or maybe a lack thereof?) that is making this happen: center=pruned_vidrl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants