-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore specific characters in all distance calculations #707
Conversation
…tance calculation. Ignored characters can be specified in the distance map as a list of characters.
Codecov Report
@@ Coverage Diff @@
## master #707 +/- ##
==========================================
+ Coverage 30.89% 30.90% +0.01%
==========================================
Files 41 41
Lines 5648 5649 +1
Branches 1365 1365
==========================================
+ Hits 1745 1746 +1
Misses 3828 3828
Partials 75 75
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @benjaminotter! This looks good to me. I made a couple comments about alternate ways to think about this implementation, but this is good to merge.
augur/distance.py
Outdated
if "ignored_characters" in distance_map: | ||
ignored_characters = distance_map["ignored_characters"] | ||
else: | ||
ignored_characters = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is great and very readable, although you can also use the following idiom for slightly less verbose code:
ignored_characters = distance_map.get("ignored_characters", [])
augur/distance.py
Outdated
if node_a_sequences[gene][site] not in ignored_characters and node_b_sequences[gene][site] not in ignored_characters: | ||
if node_a_sequences[gene][site] != node_b_sequences[gene][site]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and works for me. One minor consideration is that the tests in the first line will almost always evaluate to True
, while the second line will be True
much less often. In practice the order of these boolean checks won't make the distance calculations that much slower, but you could consider an approach like this that uses Python's greedy evaluation of logical operators to only check ignored characters when there is a mismatch:
if (node_a_sequences[gene][site] != node_b_sequences[gene][site] and
node_a_sequences[gene][site] not in ignored_characters and
node_b_sequences[gene][site] not in ignored_characters):
# Do some stuff
@huddlej Thank you for the comments, i implemented both improvements in the latest commit. |
Description of proposed changes
Adds support for ignoring specific characters during distance calculation. The ignored characters are defined in the distance map as a list of characters:
The specification of those characters in the distance map is optional, making it compatible with distance maps that don't specify any characters to ignore.
Related issue(s)
Fixes #693
Testing
Added doctests as suggested in #693: