-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reference outgroup for 2y/6m builds #89
Conversation
1b78a87
to
1f5a23c
Compare
Add reference outgroup for 2y and 6m builds to avoid incorrect rooting of trees. The reference is pruned from the tree after the refine step based on James' code snippet¹. The references added are the references used by Nextclade, but with the strain names changed to match the strain name within fauna. ¹ nextstrain/augur#340 (comment)
1f5a23c
to
03df388
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one non-blocking comment and some additional ideas below, but I think this is generally good to merge as a patch to the current workflow.
One minor technical point is that the prune_reference
rule should use a shell
block that calls a script with the logic in the current run
block. Snakemake's run blocks tend to be difficult to debug and don't work with --use-conda
.
At first I wondered why we needed to define a new reference strain per lineage when we already have these for alignment (config/reference_{lineage}_{segment}.gb
). Then I realized that the reference for alignment doesn't get included in the alignment output from the codon aligner script, so we have to separately add a reference sequence for rooting. This behavior of the codon aligner is different from augur align where the reference is automatically included as the first sequence in the output and from nextalign which provides an --include-reference
flag.
Thinking ahead to how we'd implement this in the refactored build, I wanted to see if we could just use the existing alignment reference to root the divergence and time trees. By default, IQ-TREE uses the first sequence in the alignment to root the tree (or you can specify the sequence name to root with using -o <sequence_name>
). If we produce a properly rooted tree from IQ-TREE, we can tell augur refine to keep this root and skip any additional rooting logic (--keep-root
). We might still need to prune the reference from the time tree, to keep that older strain out of the analysis.
To test this idea, I added a --include-reference
argument to the codon alignment script so the reference sequence is the first sequence in the alignment output and I built the following tree from the resulting alignment. The sequence at the root is the reference used for the alignment.
Then, I ran augur refine with --keep-root
instead of the explicit root. Interestingly, TreeTime kept the rooting from IQ-TREE, but it pruned the reference sequence with the clock filter (since it is so much older than all the other strains in the tree). The resulting tree looked like:
The earliest sequence in the tree is the other California/7 reference that was force-included in this branch. The pruned tree that removes California/7 looked like this:
Since we already use nextalign and IQ-TREE in the refactored workflow, we could make the following minimal modifications to that workflow to use this rooting approach:
- run nextalign with
--include-reference
- run augur refine with
--keep-root
(relying on IQ-TREE's implicit rooting)
There are some issues with this approach. I'm not sure that TreeTime will always drop the root from the tree with the clock filter or we not always enable the clock filter in the future. This approach also relies on the default behavior of IQ-TREE; if we switch tree builders, the root could appear in the tree again. A better approach might be to implement @trvrb's recommended --remove-outgroup
flag for augur refine, so we can explicitly specify the sequence to root with and then prune it no matter what other settings are used1. For this approach we would:
- run nextalign with
--include-reference
- run augur refine with
--root <reference_name>
and--remove-outgroup
These changes don't require any additional rules or custom scripts and provide a general pattern to use across pathogens. We could probably implement this new flag in augur refine relatively quickly and just use it instead of implementing the less reliable short-term solution I outlined above.
Footnotes
-
We may want to rename
--remove-outgroup
to--remove-root
, to make the flag name consistent with--root
and--keep-root
. ↩
Snakemake run blocks are hard to debug and they do not run in rule-specific conda environments, so we try to avoid them in our workflows!
Print a warning if reference is empty, but still copy the input tree to the output tree. This is expected to happen for any resolution that is not 2y or 6m in our live builds.
I had to make an additional commit post-merge to fix the WHO builds. I ran into an
I assume there's some sort of pattern matching (or maybe a lack thereof?) that is making this happen: |
Description of proposed changes
Add reference outgroup for 2y and 6m builds to avoid incorrect rooting of trees. The reference is pruned from the tree after the refine step using James' code snippet¹.
The references added are the references used by Nextclade, but with the strain names changed to match the strain name within fauna.
¹ nextstrain/augur#340 (comment)
(There's a bunch of path changes, but the main change is in
Snakefile_base
)Related issue(s)
Related to nextstrain/augur#340
Testing
nextstrain.org/staging/include-reference-outgroup/flu/seasonal/*
.