distance: Allow users to ignore specific characters in all distance calculations #693

huddlej · 2021-03-12T18:31:39Z

Context

When calculating distances between sequences, augur distance currently counts all mismatches toward the total distance based on weights in the distance map. However, we often want to ignore mismatches between ambiguous characters such as Ns for nucleotide sequences and Xs for amino acid sequences. While users can technically accomplish this behavior by defining a sequence-specific distance map, they would need to define such a map from each ambiguous character to all other possible characters at all sites.

Description

When calculating distances between nodes, users should be able to ignore specific characters without defining a full distance map for those characters.

Possible solutions

One potential solution is to provide a command line argument so users can specify which characters to ignore in distance calculations. This interface could look like:

augur distance \
    --ignore-characters X N -
    --tree ...

We probably want to avoid a situation where we need to maintain lists of ambiguous characters to ignore for different sequence types (even though this list is probably just a couple of characters). Allowing users to specify which characters to ignore makes the interface flexible enough to support other use cases in the future, too.

Another solution would be to define these ignored characters in the distance map like so (for a simple Hamming distance between amino acid or nucleotide sequences that ignores X's and N's):

{
    "name": "Hamming distance",
    "default": 1,
    "ignored_characters": [
        "N",
        "X"
    ],
    "map": {}
}

This implementation keeps important information about the distance calculations close to the map itself and allows different behaviors per map in the same distance command (through multiple inputs to the --map argument).

The text was updated successfully, but these errors were encountered:

huddlej · 2021-03-26T17:35:12Z

After thinking about the possible solutions proposed above, I've decided that storing the ignored characters configuration in the distance maps is the most flexible solution. Here are some potential tests for this new functionality:

"""
Ignore specific characters defined in the distance map.

>>> node_a_sequences = {"gene": "ACTGG"}
>>> node_b_sequences = {"gene": "A--GN"}
>>> distance_map = {"default": 1, ignored_characters=["-"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1
>>> distance_map = {"default": 1, ignored_characters=["-", "N"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0
"""

@benjaminotter, I've assigned this issue to you, as the first of a couple small but important issues we'd like to resolve. Can you add these doctests to the get_distance_between_nodes in augur/distance.py, confirm that they fail by running ./run_tests.sh -k get_distance_between_nodes, then write the Python code necessary to make these tests pass? See the related PR that adds support for indels in the distance command for an example of how we've done this before.

In case you didn't have to do this when you added the index command, you'll need to setup a dev environment for Augur. You can do this with Conda like so:

# Create a Conda environment for Augur development, installing third-party tools.
conda create -n augur -c conda-forge -c bioconda augur
conda activate augur

# Clone the repo.
git clone https://github.com/nextstrain/augur.git
cd augur

# Create your branch to work on this issue.
git checkout -b ignore-characters-in-distance

# Install the development version of the repo.
python3 -m pip install -e .[dev]

# Run just the tests you're working on (test-driven development mode).
# These should fail initially and pass after you add the correct code.
./run_tests.sh -k get_distance_between_nodes

# Run all tests, to make sure there haven't been any regressions elsewhere.
./run_tests.sh

Let me know if you have any questions about this issue, the suggested solution, or the dev environment. Thanks!

benjaminotter · 2021-03-26T20:03:59Z

Thank you, @huddlej, for the detailed instructions on how to approach this issue.
I'll have a look at this and let you know if there are any questions regarding the process.

huddlej added enhancement New feature or request priority: moderate To be resolved after high priority issues needs triage Needs triage by a Nextstrain team member labels Mar 12, 2021

huddlej mentioned this issue Mar 17, 2021

Count insertion/deletion events once in pairwise distances #698

Merged

huddlej removed the needs triage Needs triage by a Nextstrain team member label Mar 26, 2021

huddlej assigned benjaminotter Mar 26, 2021

benjaminotter mentioned this issue Mar 29, 2021

ignore specific characters in all distance calculations #707

Merged

huddlej closed this as completed in #707 Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distance: Allow users to ignore specific characters in all distance calculations #693

distance: Allow users to ignore specific characters in all distance calculations #693

huddlej commented Mar 12, 2021

huddlej commented Mar 26, 2021

benjaminotter commented Mar 26, 2021

distance: Allow users to ignore specific characters in all distance calculations #693

distance: Allow users to ignore specific characters in all distance calculations #693

Comments

huddlej commented Mar 12, 2021

huddlej commented Mar 26, 2021

benjaminotter commented Mar 26, 2021