Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distance: Count indels as single events #692

Closed
huddlej opened this issue Mar 12, 2021 · 0 comments · Fixed by #698
Closed

distance: Count indels as single events #692

huddlej opened this issue Mar 12, 2021 · 0 comments · Fixed by #698
Assignees
Labels
enhancement New feature or request priority: moderate To be resolved after high priority issues

Comments

@huddlej
Copy link
Contributor

huddlej commented Mar 12, 2021

Context

When calculating distances between sequences with insertion or deletion (indel) events, augur distance currently counts each mismatch in a single indel event as separate events. Biologically, this implementation does not make sense, as indels are most likely the result of a single mutational event.

For related discussion, see the PR to add mutation counts to the ncov workflow.

Description

When calculating distances between nodes, augur should check for one or more consecutive gap characters (-) in each sequence and only count the first of these characters toward the overall distance.

An additional complexity is that a site-specific distance map may specify different weights for the positions affected by an indel event. The distance function must report a single weight value for the entire indel event. Reasonable choices include the maximum weight of all affected positions or the mean weight of those positions.

A mutation-specific distance map may specify different weights for specific mutations that occur at sites affected by an indel. As in the solution above, one could report the maximum or mean weight of any mutation across all affected positions. If the distance map provides specific weights for gaps at sites, the weight for the indel event should be the maximum (or mean) of the gap weights at the affected sites (instead of the weights of all mutations).

Examples

Add the following doctests to the get_distance_between_nodes function and use these in test-driven development of the functionality proposed above.

"""
Treat a single indel as one event.

>>> node_a_sequences = {"gene": "ACTG"}
>>> node_b_sequences = {"gene": "A--G"}
>>> distance_map = {"default": 1, "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1

Use the maximum weight of all sites affected by an indel with a site-specific distance map.

>>> distance_map = {"default": 0, "map": {"gene": {1: 1, 2: 2}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
2

Use the maximum weight of all mutations at all sites affected by an indel with a mutation-specific distance map.

>>> distance_map = {"default": 0, "map": {"gene": {1: {('C', 'G'): 1, ('C', 'A'): 2}}, {2: {('T', 'G'): 3, ('T', 'A'): 2}}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
3

Use the maximum weight of gaps at all sites affected by an indel with a mutation-specific distance map.

>>> distance_map = {"default": 0, "map": {"gene": {1: {('C', '-'): 1, ('C', 'A'): 2}}, {2: {('T', 'G'): 3, ('T', '-'): 2}}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
2
"""
@huddlej huddlej added enhancement New feature or request priority: moderate To be resolved after high priority issues labels Mar 12, 2021
@huddlej huddlej self-assigned this Mar 17, 2021
@huddlej huddlej added this to the Next release 11.X.X milestone Mar 17, 2021
@huddlej huddlej removed this from the Next release 11.X.X milestone Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: moderate To be resolved after high priority issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant