-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow inheritance in clades.tsv #846
Conversation
Tests are failing with Lines 53 to 61 in 780e3d3
|
Thanks Victor for pointing me to where I need to change. You can do it to if you like, it's a branch on ncov itself for everyone to edit. I'd be particularly interested in feedback on testing: is there a neater way than creating a new TSV for every test case? Would it make sense to construct the TSVs test files programmatically to be DRYer? Or is it just fine like this, as it's nice and easy to reason? |
I think this is just fine and preferable over generation to me. It's easier to reason when you can look at the static file, and there's no harm in a handful of additional test data files. |
Not following any convention, but what I did for augur/tests/test_filter_groupby.py Lines 5 to 15 in 780e3d3
Then "break" it in individual unit tests like so: augur/tests/test_filter_groupby.py Lines 67 to 68 in 780e3d3
Though this approach isn't directly usable by your code as-is since |
Alternatively, the TSVs are short enough that it would be reasonable IMO to inline them as strings ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool, @corneliusroemer. The implementation is clean and the tests are easy to follow. For this type of data, I prefer the standalone TSV files to the inline data in Python because I could easily load these data into pandas from the Python prompt and test out alternate approaches to data processing.
I made a couple of suggestions below that could make read_in_clade_definitions
a little easier to read and maintain.
Edit: We will need to include networkx
as a dependency in our Bioconda recipe, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've resolved all of your change requests @huddlej
Yes, this is important. If we don't yet do it, it'd be good to run proper tests on our Bioconda recipe too, so we would catch things like this automatically in case we happen to forget. |
Codecov Report
@@ Coverage Diff @@
## master #846 +/- ##
==========================================
+ Coverage 33.93% 34.22% +0.29%
==========================================
Files 41 41
Lines 5902 5922 +20
Branches 1507 1515 +8
==========================================
+ Hits 2003 2027 +24
+ Misses 3817 3813 -4
Partials 82 82
Continue to review full report at Codecov.
|
I've added a simple functional cram test using data from Zika. I removed all the sequences contained in the node data jsons to reduce impact on repo size. |
if not nx.is_directed_acyclic_graph(G): | ||
raise ValueError(f"Clade definitions contain cycles {list(nx.simple_cycles(G))}") | ||
|
||
for clade in islice(nx.topological_sort(G),1,None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments that explain what these functions do would be helpful. This is some sort of traversal, I guess, but can't tell from the code what kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the comments I added help?
augur/clades.py
Outdated
raise ValueError(f"Clade definitions contain cycles {list(nx.simple_cycles(G))}") | ||
|
||
for clade in islice(nx.topological_sort(G),1,None): | ||
clades[clade] = clades[next(G.predecessors(clade))].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. this is the parent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it adds definitions from parent clade to daughter clade
clades[clade][(row.gene, int(row.site)-1)] = row.alt | ||
|
||
# Convert items from dict[str, dict[(str,int),str]] to dict[str, list[(str,int,str)]] | ||
clades = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe point out that mutations defined for child overwrite inherited definitions.
@victorlin did you manually add networkx to the bioconda recipe or did the bioconda bump bot add the new dependency automatically? It's definitely there which surprised me, thought I'd had to do it manually: |
Yeah, I caught the auto-bump PR bioconda/bioconda-recipes#32842 and told them to close it in favor of the manual PR. |
Description of proposed changes
This PR extends
read_in_clade_definitions
inclades.py
to allow overwritable inheritance using the following syntax:Significant simplification of
clades.tsv
using this syntax can be observed in this ncov PR: https://github.com/nextstrain/ncov/pull/855/filesIt uses networkx to create a directed graph of clades, to traverse in topological order and check that the graph is a DAG.
Helpful errors are raised whenever the graph is not a proper DAG.
I also enable comments in clade tsv by adding the kwarg
comment='#'
in thepd.read_csv
The function is rigorously unit tested by
test_clades.py
with 100% coverage, including the raising of exceptions.Note: It may be necessary to add networkx as a dependency, if it isn't one already.
Related issue(s)
Resolves #823
Testing
Clades.py does not have any integration tests yet. I added unit tests just for the
read_in_clade_definitions
function. But given that the change is very local, I don't expect anything to break.