You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When calculating distances between sequences with insertion or deletion (indel) events, augur distance currently counts each mismatch in a single indel event as separate events. Biologically, this implementation does not make sense, as indels are most likely the result of a single mutational event.
When calculating distances between nodes, augur should check for one or more consecutive gap characters (-) in each sequence and only count the first of these characters toward the overall distance.
An additional complexity is that a site-specific distance map may specify different weights for the positions affected by an indel event. The distance function must report a single weight value for the entire indel event. Reasonable choices include the maximum weight of all affected positions or the mean weight of those positions.
A mutation-specific distance map may specify different weights for specific mutations that occur at sites affected by an indel. As in the solution above, one could report the maximum or mean weight of any mutation across all affected positions. If the distance map provides specific weights for gaps at sites, the weight for the indel event should be the maximum (or mean) of the gap weights at the affected sites (instead of the weights of all mutations).
Examples
Add the following doctests to the get_distance_between_nodes function and use these in test-driven development of the functionality proposed above.
"""Treat a single indel as one event.>>> node_a_sequences = {"gene": "ACTG"}>>> node_b_sequences = {"gene": "A--G"}>>> distance_map = {"default": 1, "map": {}}>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)1Use the maximum weight of all sites affected by an indel with a site-specific distance map.>>> distance_map = {"default": 0, "map": {"gene": {1: 1, 2: 2}}}>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)2Use the maximum weight of all mutations at all sites affected by an indel with a mutation-specific distance map.>>> distance_map = {"default": 0, "map": {"gene": {1: {('C', 'G'): 1, ('C', 'A'): 2}}, {2: {('T', 'G'): 3, ('T', 'A'): 2}}}}>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)3Use the maximum weight of gaps at all sites affected by an indel with a mutation-specific distance map.>>> distance_map = {"default": 0, "map": {"gene": {1: {('C', '-'): 1, ('C', 'A'): 2}}, {2: {('T', 'G'): 3, ('T', '-'): 2}}}}>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)2"""
The text was updated successfully, but these errors were encountered:
Context
When calculating distances between sequences with insertion or deletion (indel) events, augur distance currently counts each mismatch in a single indel event as separate events. Biologically, this implementation does not make sense, as indels are most likely the result of a single mutational event.
For related discussion, see the PR to add mutation counts to the ncov workflow.
Description
When calculating distances between nodes, augur should check for one or more consecutive gap characters (
-
) in each sequence and only count the first of these characters toward the overall distance.An additional complexity is that a site-specific distance map may specify different weights for the positions affected by an indel event. The distance function must report a single weight value for the entire indel event. Reasonable choices include the maximum weight of all affected positions or the mean weight of those positions.
A mutation-specific distance map may specify different weights for specific mutations that occur at sites affected by an indel. As in the solution above, one could report the maximum or mean weight of any mutation across all affected positions. If the distance map provides specific weights for gaps at sites, the weight for the indel event should be the maximum (or mean) of the gap weights at the affected sites (instead of the weights of all mutations).
Examples
Add the following doctests to the
get_distance_between_nodes
function and use these in test-driven development of the functionality proposed above.The text was updated successfully, but these errors were encountered: