Add keep-overhangs argument to ancestral #286
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds
--keep-overhangs
as an argument toancestral
to pass to Treetime in order to not reconstruct bases for gaps on either side of the alignment.We may want to discuss the argument name or how it should behave. Currently the default is that
keep-overhangs
is False, and thenot
of this is used to setfill_overhangs
to pass to Treetime. (So default behaviour is the same as previously.) Supplying the argument--keep_overhangs
turns this True, which setsfill_overhangs
to False in Treetime.======================
I caught this because of some unexpected behaviour, and wanted to document this - that's what follows.
I received partial VP1 Echovirus 30 sequences without a reference to align. I went on Genbank and found a sequence roughly the same length (~18bp longer) to use.
After aligning, here's how my sequences look:

Most sequences have a gap at 360. A few have N's. The alignments are in-frame, so position 360 is the 3rd position of a codon. Some sequences have 'GT-' as the codon (positions 358, 359, 360), some have 'AT-' (and of course 'NNN' and 'NN-').
With

--keep-ambiguous
on but the default Treetime--fill-overhangs
being True, here's the output fromancestral
:(I've tried to make it easier to look at by just including the codon at 358-360 to the end of sequence)
The 2nd position is always T in sequences that have this information, so this is reconstructed in all sequences, even those that were originally 'NNN' or 'NN-' (first two).
In the next 4, the codons are either 'ATN' or 'GTN' (originally 'AT-' or 'GT-').
In the last four, these are internal nodes. As there is no information whatsoever at position 360 to the end of the sequence. With
--fill-overhangs
, all these sites are 'reconstructed' as A's.So, on internal nodes we now have two codons: 'GTA' and 'ATA'. Note that we actually have no information about what the third position is in the actual sequences. Following this, the trailing gaps at the end of the sequence (where the reference was longer than my alignment) have been 'reconstructed' as A's, though again there's no information at these sites.
After

translate
, this is what we have:AA 120 is the translation for nucleotides 358-360.
For the first two sequences (originally 'NN-' or 'NNN', then 'NTN' above), AA 120 is 'X'. For the next two, which were 'AT-' and then 'ATN', these are also translated to 'X'. The two after this, however, were 'GT-' and then 'GTN', which translates to V, as the third position doesn't matter.
Finally, for the internal nodes, we have V and I as these were reconstructed as 'GTA' and 'ATA'. Also note that for internal nodes, the trailing gaps were translated to A's, and so now are 6 Ks.
When we look at this in

auspice
:At position 120:
It's a bit odd that some sequences are V at AA 120, given that we used
--keep-ambiguous
and we know there's no sequence information at the third nucleotide. However, this is because 'GTN' is always translated to V byBio.Seq.translate()
since the third position doesn't matter. But, by the same fact, the information isn't untrue.What's very odd is that we have internal nodes with the AA I (yellow branches) even though close inspection shows us that no tips are I. This information is totally fabricated, as we have no information on what the third position is. ('ATN' is translated to X by
Bio.Seq.translate()
because it could be I or M depending on 3rd position.)We also have a strange situation if we look at AA 121:

Here, though information is missing at every tip, we see that the ancestral state at this position is K. This is simply the translation of 'AAA', the default reconstruction of sites where only ambiguous bases are present.
If we use

--keep-overhangs
, here's how ancestral looks:After

translate
, all sequences are translated as 'X------' from AA 120. (Though 'GTN' is translated as V, 'GT-' is translated to X.) And inauspice
, things look as expected: