Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/integer branch length #618

Merged
merged 10 commits into from
Oct 21, 2020
Merged

Fix/integer branch length #618

merged 10 commits into from
Oct 21, 2020

Conversation

rneher
Copy link
Member

@rneher rneher commented Oct 14, 2020

Description of proposed changes

make branch length a proper integer when units are mutations by counting mutations associated with branches. Our previous branch length calculation simply multiplied the expected number of mutations per site by the length of the genome. This resulted in slightly off numbers due to numerical inaccuracies or missing data.

Related issue(s)

#617

Testing

check branch length. these are up at

https://nextstrain.org/staging/ncovRefocus/europe
https://nextstrain.org/staging/ncovRefocus/north-america

Thank you for contributing to Nextstrain!

@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #618 into master will decrease coverage by 0.05%.
The diff coverage is 4.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #618      +/-   ##
==========================================
- Coverage   28.37%   28.32%   -0.06%     
==========================================
  Files          39       39              
  Lines        5332     5345      +13     
  Branches     1306     1311       +5     
==========================================
+ Hits         1513     1514       +1     
- Misses       3766     3778      +12     
  Partials       53       53              
Impacted Files Coverage Δ
augur/refine.py 5.46% <4.54%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efcc8be...8722b9c. Read the comment docs.

@rneher rneher requested a review from huddlej October 14, 2020 17:41
@jameshadfield
Copy link
Member

jameshadfield commented Oct 15, 2020

This gives me the following error -- is there a required treetime update or other change I need to make? (There is no error on current master). Note that I'm not inferring a timetree here!

This fails if a timetree is not being inferred. I think we need to tweak the initialisation of the TreeAnc object in this case at
https://github.com/nextstrain/augur/blob/fix/integer_branch_length/augur/refine.py#L238

$ augur refine --tree iqtree.nwk --alignment dataset.fasta --metadata metadata.tsv --root "USA/PR-B74P/2020" \
--output-tree tree.nwk --output-node-data tree.node_data.json --divergence-units mutations

augur refine is using TreeTime version 0.7.4
Traceback (most recent call last):
  File "/Users/naboo/miniconda3/envs/augur/bin/augur", line 11, in <module>
    load_entry_point('nextstrain-augur', 'console_scripts', 'augur')()
  File "/Users/naboo/github/nextstrain/augur/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/Users/naboo/github/nextstrain/augur/augur/__init__.py", line 74, in run
    return args.__command__.run(args)
  File "/Users/naboo/github/nextstrain/augur/augur/refine.py", line 254, in run
    n_muts = len([p for a,p,d in node.mutations if different(a,d)])
  File "/Users/naboo/miniconda3/envs/augur/lib/python3.6/site-packages/treetime/treeanc.py", line 37, in mutations
    return node.tt.data.differences(node.up.cseq, node.tt.data.aln[node.name], seq2_compressed=False)
  File "/Users/naboo/miniconda3/envs/augur/lib/python3.6/site-packages/treetime/treeanc.py", line 26, in compressed_sequence
    raise ValueError('Ancestral sequences are not yet inferred')
ValueError: Ancestral sequences are not yet inferred

@rneher
Copy link
Member Author

rneher commented Oct 16, 2020

Good call. Hadn't thought of the case without timetree. But hey, at least we got a useful error message! I'll push a fix

@rneher
Copy link
Member Author

rneher commented Oct 16, 2020

could you retest this? I stuck the ancestral reconstruction into the block where we do the mutation units to not slow down the step otherwise.

Adds functional tests based on the Zika build including commands that
run the usual time tree inference, use refine to label internal nodes
without running time tree, and use refine to specify the type of
divergence unit to report (mutations or mutations per site).

Note that the last of these tests currently fails because the command
asks for mutations as the divergence unit, but the input does not
include an alignment FASTA. In this case, augur refine should throw an
error alerting the user that they are missing a required input.
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code works for me under the normal time tree case and in James’s case where no time tree is requested but an alignment is still provided. This code will silently produce meaningless branch lengths if no time tree is requested and no alignment is provided, though. As refine is written now, we allow users to run the command this way to assign names to internal nodes.

I think we should add a check immediately after arguments have been parsed for this invalid request (“mutations” as a divergence unit without an alignment input), print a helpful error message to stderr, and exit.

I will push a commit in a minute that adds functional tests for refine and includes a failing test for this scenario. It also includes tests for the normal execution of refine with timetree and James’s example.

I also wonder if we can move the different function to TreeTime's seq_utils.py so augur does not need to know about the implementation of profile_maps. I added some comments inline about this. I get that this is an annoying bit of code shuffling that will require a new TreeTime release, but I think it will make this code easier to maintain and also provide a helpful new utility that we'll want to use elsewhere. I didn't know about the profile_maps before, but I would definitely re-use a function like the one in this PR for other analyses.

(A new TreeTime release would also have the nice side effect of getting rid of those VisibleDeprecationWarning messages from numpy and the amino acid models arrays.)

for node in node_data['nodes']:
if not args.timetree:
tt.infer_ancestral_sequences()
nuc_map = profile_maps['nuc']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the "nuc_nogap" map here instead of the "nuc" map, won't that allow us to remove the checks for "-" and "N" characters in the different function below? It seems like this map will ignore those gap bases in the calculations below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main reason for doing this check for N and - was speed (without profiling). I think we should use the same alphabet as the build though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Thank you for clarifying this!

huddlej and others added 3 commits October 20, 2020 10:45
Makes minor edits that should improve readability for future maintainers
who may not immediately recognize the meaning of the single-letter
abbreviations.
Rewrites the functional test for refine with mutational divergence units
to expect an error message and exit code. Also, updates the error
logging from augur refine to write to stderr instead of stdout both for
improved user experience and to simplify functional tests.
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I updated the functional tests to pass with the new error message.

for node in node_data['nodes']:
if not args.timetree:
tt.infer_ancestral_sequences()
nuc_map = profile_maps['nuc']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Thank you for clarifying this!

@huddlej huddlej merged commit 295ad4f into master Oct 21, 2020
@huddlej huddlej deleted the fix/integer_branch_length branch October 21, 2020 22:58
@huddlej huddlej added this to the Feature release 10.1.0 milestone Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants