Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep-overhangs argument to ancestral #286

Merged
merged 1 commit into from
May 22, 2019
Merged

Add keep-overhangs argument to ancestral #286

merged 1 commit into from
May 22, 2019

Conversation

emmahodcroft
Copy link
Member

This PR adds --keep-overhangs as an argument to ancestral to pass to Treetime in order to not reconstruct bases for gaps on either side of the alignment.

We may want to discuss the argument name or how it should behave. Currently the default is that keep-overhangs is False, and the not of this is used to set fill_overhangs to pass to Treetime. (So default behaviour is the same as previously.) Supplying the argument --keep_overhangs turns this True, which sets fill_overhangs to False in Treetime.

======================

I caught this because of some unexpected behaviour, and wanted to document this - that's what follows.

I received partial VP1 Echovirus 30 sequences without a reference to align. I went on Genbank and found a sequence roughly the same length (~18bp longer) to use.

After aligning, here's how my sequences look:
image

Most sequences have a gap at 360. A few have N's. The alignments are in-frame, so position 360 is the 3rd position of a codon. Some sequences have 'GT-' as the codon (positions 358, 359, 360), some have 'AT-' (and of course 'NNN' and 'NN-').

With --keep-ambiguous on but the default Treetime --fill-overhangs being True, here's the output from ancestral:
image
(I've tried to make it easier to look at by just including the codon at 358-360 to the end of sequence)

The 2nd position is always T in sequences that have this information, so this is reconstructed in all sequences, even those that were originally 'NNN' or 'NN-' (first two).
In the next 4, the codons are either 'ATN' or 'GTN' (originally 'AT-' or 'GT-').
In the last four, these are internal nodes. As there is no information whatsoever at position 360 to the end of the sequence. With --fill-overhangs, all these sites are 'reconstructed' as A's.
So, on internal nodes we now have two codons: 'GTA' and 'ATA'. Note that we actually have no information about what the third position is in the actual sequences. Following this, the trailing gaps at the end of the sequence (where the reference was longer than my alignment) have been 'reconstructed' as A's, though again there's no information at these sites.

After translate, this is what we have:
image
AA 120 is the translation for nucleotides 358-360.

For the first two sequences (originally 'NN-' or 'NNN', then 'NTN' above), AA 120 is 'X'. For the next two, which were 'AT-' and then 'ATN', these are also translated to 'X'. The two after this, however, were 'GT-' and then 'GTN', which translates to V, as the third position doesn't matter.
Finally, for the internal nodes, we have V and I as these were reconstructed as 'GTA' and 'ATA'. Also note that for internal nodes, the trailing gaps were translated to A's, and so now are 6 Ks.

When we look at this in auspice:
At position 120:
image

It's a bit odd that some sequences are V at AA 120, given that we used --keep-ambiguous and we know there's no sequence information at the third nucleotide. However, this is because 'GTN' is always translated to V by Bio.Seq.translate() since the third position doesn't matter. But, by the same fact, the information isn't untrue.

What's very odd is that we have internal nodes with the AA I (yellow branches) even though close inspection shows us that no tips are I. This information is totally fabricated, as we have no information on what the third position is. ('ATN' is translated to X by Bio.Seq.translate() because it could be I or M depending on 3rd position.)

We also have a strange situation if we look at AA 121:
image

Here, though information is missing at every tip, we see that the ancestral state at this position is K. This is simply the translation of 'AAA', the default reconstruction of sites where only ambiguous bases are present.

If we use --keep-overhangs, here's how ancestral looks:
image

After translate, all sequences are translated as 'X------' from AA 120. (Though 'GTN' is translated as V, 'GT-' is translated to X.) And in auspice, things look as expected:
image

@rneher rneher merged commit 9aa4333 into master May 22, 2019
@rneher rneher deleted the keep-overhang branch May 22, 2019 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants