You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When calculating distances between sequences, augur distance currently counts all mismatches toward the total distance based on weights in the distance map. However, we often want to ignore mismatches between ambiguous characters such as Ns for nucleotide sequences and Xs for amino acid sequences. While users can technically accomplish this behavior by defining a sequence-specific distance map, they would need to define such a map from each ambiguous character to all other possible characters at all sites.
Description
When calculating distances between nodes, users should be able to ignore specific characters without defining a full distance map for those characters.
Possible solutions
One potential solution is to provide a command line argument so users can specify which characters to ignore in distance calculations. This interface could look like:
augur distance \
--ignore-characters X N -
--tree ...
We probably want to avoid a situation where we need to maintain lists of ambiguous characters to ignore for different sequence types (even though this list is probably just a couple of characters). Allowing users to specify which characters to ignore makes the interface flexible enough to support other use cases in the future, too.
Another solution would be to define these ignored characters in the distance map like so (for a simple Hamming distance between amino acid or nucleotide sequences that ignores X's and N's):
This implementation keeps important information about the distance calculations close to the map itself and allows different behaviors per map in the same distance command (through multiple inputs to the --map argument).
The text was updated successfully, but these errors were encountered:
After thinking about the possible solutions proposed above, I've decided that storing the ignored characters configuration in the distance maps is the most flexible solution. Here are some potential tests for this new functionality:
"""Ignore specific characters defined in the distance map.>>> node_a_sequences = {"gene": "ACTGG"}>>> node_b_sequences = {"gene": "A--GN"}>>> distance_map = {"default": 1, ignored_characters=["-"], "map": {}}>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)1>>> distance_map = {"default": 1, ignored_characters=["-", "N"], "map": {}}>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)0"""
@benjaminotter, I've assigned this issue to you, as the first of a couple small but important issues we'd like to resolve. Can you add these doctests to the get_distance_between_nodes in augur/distance.py, confirm that they fail by running ./run_tests.sh -k get_distance_between_nodes, then write the Python code necessary to make these tests pass? See the related PR that adds support for indels in the distance command for an example of how we've done this before.
In case you didn't have to do this when you added the index command, you'll need to setup a dev environment for Augur. You can do this with Conda like so:
# Create a Conda environment for Augur development, installing third-party tools.
conda create -n augur -c conda-forge -c bioconda augur
conda activate augur
# Clone the repo.
git clone https://github.com/nextstrain/augur.git
cd augur
# Create your branch to work on this issue.
git checkout -b ignore-characters-in-distance
# Install the development version of the repo.
python3 -m pip install -e .[dev]
# Run just the tests you're working on (test-driven development mode).# These should fail initially and pass after you add the correct code.
./run_tests.sh -k get_distance_between_nodes
# Run all tests, to make sure there haven't been any regressions elsewhere.
./run_tests.sh
Let me know if you have any questions about this issue, the suggested solution, or the dev environment. Thanks!
Thank you, @huddlej, for the detailed instructions on how to approach this issue.
I'll have a look at this and let you know if there are any questions regarding the process.
Context
When calculating distances between sequences, augur distance currently counts all mismatches toward the total distance based on weights in the distance map. However, we often want to ignore mismatches between ambiguous characters such as
N
s for nucleotide sequences andX
s for amino acid sequences. While users can technically accomplish this behavior by defining a sequence-specific distance map, they would need to define such a map from each ambiguous character to all other possible characters at all sites.Description
When calculating distances between nodes, users should be able to ignore specific characters without defining a full distance map for those characters.
Possible solutions
One potential solution is to provide a command line argument so users can specify which characters to ignore in distance calculations. This interface could look like:
We probably want to avoid a situation where we need to maintain lists of ambiguous characters to ignore for different sequence types (even though this list is probably just a couple of characters). Allowing users to specify which characters to ignore makes the interface flexible enough to support other use cases in the future, too.
Another solution would be to define these ignored characters in the distance map like so (for a simple Hamming distance between amino acid or nucleotide sequences that ignores X's and N's):
This implementation keeps important information about the distance calculations close to the map itself and allows different behaviors per map in the same distance command (through multiple inputs to the
--map
argument).The text was updated successfully, but these errors were encountered: