ignore specific characters in all distance calculations #707

benjaminotter · 2021-03-29T16:35:42Z

Description of proposed changes

Adds support for ignoring specific characters during distance calculation. The ignored characters are defined in the distance map as a list of characters:

{
    "default": 1,
    "ignored_characters": ["N", "-"],
    "map": {}
}

The specification of those characters in the distance map is optional, making it compatible with distance maps that don't specify any characters to ignore.

Related issue(s)

Fixes #693

Testing

Added doctests as suggested in #693:

Ignore specific characters defined in the distance map.

>>> node_a_sequences = {"gene": "ACTGG"}
>>> node_b_sequences = {"gene": "A--GN"}
>>> distance_map = {"default": 1, "ignored_characters":["-"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1
>>> distance_map = {"default": 1, "ignored_characters":["-", "N"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0

…tance calculation. Ignored characters can be specified in the distance map as a list of characters.

codecov · 2021-03-29T16:37:29Z

Codecov Report

Merging #707 (328c1ff) into master (5c12fd8) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #707      +/-   ##
==========================================
+ Coverage   30.89%   30.90%   +0.01%     
==========================================
  Files          41       41              
  Lines        5648     5649       +1     
  Branches     1365     1365              
==========================================
+ Hits         1745     1746       +1     
  Misses       3828     3828              
  Partials       75       75

Impacted Files	Coverage Δ
augur/distance.py	`37.79% <100.00%> (+0.49%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c12fd8...328c1ff. Read the comment docs.

huddlej

Thank you, @benjaminotter! This looks good to me. I made a couple comments about alternate ways to think about this implementation, but this is good to merge.

huddlej · 2021-03-30T17:40:53Z

augur/distance.py

+    if "ignored_characters" in distance_map:
+        ignored_characters = distance_map["ignored_characters"]
+    else:
+        ignored_characters = []


This approach is great and very readable, although you can also use the following idiom for slightly less verbose code:

ignored_characters = distance_map.get("ignored_characters", [])

huddlej · 2021-03-30T17:48:29Z

augur/distance.py

+            if node_a_sequences[gene][site] not in ignored_characters and node_b_sequences[gene][site] not in ignored_characters:
+                if node_a_sequences[gene][site] != node_b_sequences[gene][site]:


This looks good and works for me. One minor consideration is that the tests in the first line will almost always evaluate to True, while the second line will be True much less often. In practice the order of these boolean checks won't make the distance calculations that much slower, but you could consider an approach like this that uses Python's greedy evaluation of logical operators to only check ignored characters when there is a mismatch:

if (node_a_sequences[gene][site] != node_b_sequences[gene][site] and node_a_sequences[gene][site] not in ignored_characters and node_b_sequences[gene][site] not in ignored_characters): # Do some stuff

benjaminotter · 2021-03-31T07:12:48Z

@huddlej Thank you for the comments, i implemented both improvements in the latest commit.

Adds support and doctests for ignoring specific characters during dis…

ad79cf5

…tance calculation. Ignored characters can be specified in the distance map as a list of characters.

benjaminotter requested a review from huddlej March 29, 2021 16:35

huddlej approved these changes Mar 30, 2021

View reviewed changes

huddlej marked this pull request as ready for review March 30, 2021 17:50

Refactoring of ignored_character list and boolean checks.

328c1ff

huddlej merged commit f0ff957 into master Mar 31, 2021

huddlej deleted the ignore-characters-in-distance branch March 31, 2021 16:17

huddlej added this to the Next release milestone Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore specific characters in all distance calculations #707

ignore specific characters in all distance calculations #707

benjaminotter commented Mar 29, 2021

codecov bot commented Mar 29, 2021 •

edited

Loading

huddlej left a comment

huddlej Mar 30, 2021

huddlej Mar 30, 2021

benjaminotter commented Mar 31, 2021

		if node_a_sequences[gene][site] not in ignored_characters and node_b_sequences[gene][site] not in ignored_characters:
		if node_a_sequences[gene][site] != node_b_sequences[gene][site]:

ignore specific characters in all distance calculations #707

ignore specific characters in all distance calculations #707

Conversation

benjaminotter commented Mar 29, 2021

Description of proposed changes

Related issue(s)

Testing

codecov bot commented Mar 29, 2021 • edited Loading

Codecov Report

huddlej left a comment

Choose a reason for hiding this comment

huddlej Mar 30, 2021

Choose a reason for hiding this comment

huddlej Mar 30, 2021

Choose a reason for hiding this comment

benjaminotter commented Mar 31, 2021

codecov bot commented Mar 29, 2021 •

edited

Loading