Allow inheritance in clades.tsv #846

corneliusroemer · 2022-02-04T09:57:08Z

Description of proposed changes

This PR extends read_in_clade_definitions in clades.py to allow overwritable inheritance using the following syntax:

    clade      gene    site     alt
    Clade_1    ctpE    81       D
    Clade_2    nuc     30642    T
    Clade_3    nuc     444296   A
    Clade_3    S       1        P
    # Clade_4 inherits from Clade_3
    Clade_4    clade   Clade_3
    Clade_4    pks8    634      T
    # Inherited allele can be overwritten
    Clade_4    S       1        L

Significant simplification of clades.tsv using this syntax can be observed in this ncov PR: https://github.com/nextstrain/ncov/pull/855/files

It uses networkx to create a directed graph of clades, to traverse in topological order and check that the graph is a DAG.

Helpful errors are raised whenever the graph is not a proper DAG.

I also enable comments in clade tsv by adding the kwarg comment='#' in the pd.read_csv

The function is rigorously unit tested by test_clades.py with 100% coverage, including the raising of exceptions.

Note: It may be necessary to add networkx as a dependency, if it isn't one already.

Related issue(s)

Resolves #823

Testing

Clades.py does not have any integration tests yet. I added unit tests just for the read_in_clade_definitions function. But given that the change is very local, I don't expect anything to break.

victorlin · 2022-02-04T17:50:02Z

Tests are failing with ModuleNotFoundError – you'll want to add networkx here:

augur/setup.py

Lines 53 to 61 in 780e3d3

    
           install_requires = [ 
        
               "bcbio-gff >=0.6.0, ==0.6.*", 
        
               "biopython >=1.67, !=1.77, !=1.78", 
        
               "jsonschema >=3.0.0, ==3.*", 
        
               "packaging >=19.2", 
        
               "pandas >=1.0.0, ==1.*", 
        
               "phylo-treetime ==0.8.*", 
        
               "xopen >=1.0.1, ==1.*" 
        
           ],

corneliusroemer · 2022-02-04T18:40:02Z

Thanks Victor for pointing me to where I need to change. You can do it to if you like, it's a branch on ncov itself for everyone to edit.

I'd be particularly interested in feedback on testing: is there a neater way than creating a new TSV for every test case?

Would it make sense to construct the TSVs test files programmatically to be DRYer? Or is it just fine like this, as it's nice and easy to reason?

tsibley · 2022-02-04T18:53:13Z

Would it make sense to construct the TSVs test files programmatically to be DRYer? Or is it just fine like this, as it's nice and easy to reason?

I think this is just fine and preferable over generation to me. It's easier to reason when you can look at the static file, and there's no harm in a handful of additional test data files.

victorlin · 2022-02-04T18:54:08Z

I'd be particularly interested in feedback on testing: is there a neater way than creating a new TSV for every test case?

Not following any convention, but what I did for test_filter_groupby.py was create valid source data (pd.DataFrame) in a pytest fixture:

augur/tests/test_filter_groupby.py

Lines 5 to 15 in 780e3d3

    
           @pytest.fixture 
        
           def valid_metadata() -> pd.DataFrame: 
        
               columns = ['strain', 'date', 'country'] 
        
               data = [ 
        
                   ("SEQ_1","2020-01-XX","A"), 
        
                   ("SEQ_2","2020-02-01","A"), 
        
                   ("SEQ_3","2020-03-01","B"), 
        
                   ("SEQ_4","2020-04-01","B"), 
        
                   ("SEQ_5","2020-05-01","B") 
        
               ] 
        
               return pd.DataFrame.from_records(data, columns=columns).set_index('strain')

Then "break" it in individual unit tests like so:

augur/tests/test_filter_groupby.py

Lines 67 to 68 in 780e3d3

    
           metadata = valid_metadata.copy() 
        
           metadata.at["SEQ_2", "date"] = "XXXX-02-01"

Though this approach isn't directly usable by your code as-is since read_in_clade_definitions takes a file instead of DataFrame. You could refactor to have something like read_in_clade_definitions_from_df as an intermediate, which is then testable using my approach, but that might be too much for this PR.

tsibley · 2022-02-04T18:59:28Z

Alternatively, the TSVs are short enough that it would be reasonable IMO to inline them as strings ("""…""") directly into the test file and then write them out to temp files at runtime for passing to read_in_clade_definitions(). (pytest offers some features for passing a ready-to-use temp file to your test functions.) This maintains the nice property @victorlin mentions of being able to see the input data right above the expected output. I would advocate against constructing a DataFrame directly, as then we're not testing the full function which in practice does actually read a TSV file.

huddlej

This is really cool, @corneliusroemer. The implementation is clean and the tests are easy to follow. For this type of data, I prefer the standalone TSV files to the inline data in Python because I could easily load these data into pandas from the Python prompt and test out alternate approaches to data processing.

I made a couple of suggestions below that could make read_in_clade_definitions a little easier to read and maintain.

Edit: We will need to include networkx as a dependency in our Bioconda recipe, too.

augur/clades.py

…icit

augur/clades.py

corneliusroemer

I think I've resolved all of your change requests @huddlej

corneliusroemer · 2022-02-07T05:27:28Z

We will need to include networkx as a dependency in our Bioconda recipe, too.

Yes, this is important. If we don't yet do it, it'd be good to run proper tests on our Bioconda recipe too, so we would catch things like this automatically in case we happen to forget.

codecov · 2022-02-07T05:30:34Z

Codecov Report

Merging #846 (c4bc45a) into master (84f11b4) will increase coverage by 0.29%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
+ Coverage   33.93%   34.22%   +0.29%     
==========================================
  Files          41       41              
  Lines        5902     5922      +20     
  Branches     1507     1515       +8     
==========================================
+ Hits         2003     2027      +24     
+ Misses       3817     3813       -4     
  Partials       82       82

Impacted Files	Coverage Δ
augur/clades.py	`33.33% <100.00%> (+20.96%)`	⬆️
augur/__main__.py	`0.00% <0.00%> (-75.00%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84f11b4...c4bc45a. Read the comment docs.

corneliusroemer · 2022-02-07T08:01:47Z

I've added a simple functional cram test using data from Zika. I removed all the sequences contained in the node data jsons to reduce impact on repo size.

rneher · 2022-02-07T08:24:46Z

augur/clades.py

+    if not nx.is_directed_acyclic_graph(G):
+        raise ValueError(f"Clade definitions contain cycles {list(nx.simple_cycles(G))}")
+
+    for clade in islice(nx.topological_sort(G),1,None):


a few comments that explain what these functions do would be helpful. This is some sort of traversal, I guess, but can't tell from the code what kind.

Do the comments I added help?

rneher · 2022-02-07T08:25:10Z

augur/clades.py

+        raise ValueError(f"Clade definitions contain cycles {list(nx.simple_cycles(G))}")
+
+    for clade in islice(nx.topological_sort(G),1,None):
+        clades[clade] = clades[next(G.predecessors(clade))].copy()


same here. this is the parent?

and it adds definitions from parent clade to daughter clade

rneher · 2022-02-07T08:26:24Z

augur/clades.py

+            clades[clade][(row.gene, int(row.site)-1)] = row.alt
+
+    # Convert items from dict[str, dict[(str,int),str]] to dict[str, list[(str,int,str)]]
+    clades = {


Maybe point out that mutations defined for child overwrite inherited definitions.

corneliusroemer · 2022-02-10T09:10:08Z

@victorlin did you manually add networkx to the bioconda recipe or did the bioconda bump bot add the new dependency automatically?

It's definitely there which surprised me, thought I'd had to do it manually:

https://anaconda.org/bioconda/augur/files

tsibley · 2022-02-10T19:00:24Z

@corneliusroemer Manually: bioconda/bioconda-recipes@1432155

victorlin · 2022-02-10T19:51:21Z

Yeah, I caught the auto-bump PR bioconda/bioconda-recipes#32842 and told them to close it in favor of the manual PR.

feat: allow inheritance in clades.tsv

7a8675c

corneliusroemer requested review from jameshadfield and rneher February 4, 2022 09:57

corneliusroemer mentioned this pull request Feb 4, 2022

feat: add clade inheritance, requires new augur clades feature nextstrain/ncov#855

Merged

4 tasks

huddlej requested changes Feb 4, 2022

View reviewed changes

augur/clades.py Outdated Show resolved Hide resolved

augur/clades.py Outdated Show resolved Hide resolved

augur/clades.py Outdated Show resolved Hide resolved

augur/clades.py Outdated Show resolved Hide resolved

refactor: make tests of multiple inheritance and missing parents expl…

e44b0ae

…icit

corneliusroemer commented Feb 7, 2022

View reviewed changes

corneliusroemer requested a review from huddlej February 7, 2022 05:17

dep: add networkx > 2.5 as dependency

00d8bc2

corneliusroemer commented Feb 7, 2022

View reviewed changes

test: add basic functional clades.t test and data

8d5f1ce

rneher requested changes Feb 7, 2022

View reviewed changes

refactor: add comments to clade inheritance

c4bc45a

corneliusroemer requested a review from rneher February 7, 2022 09:05

rneher approved these changes Feb 7, 2022

View reviewed changes

huddlej approved these changes Feb 7, 2022

View reviewed changes

huddlej added this to the Major release 14.0.0 milestone Feb 7, 2022

huddlej merged commit ac48e7b into master Feb 7, 2022

huddlej deleted the feat/clades-inheritance branch February 7, 2022 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow inheritance in clades.tsv #846

Allow inheritance in clades.tsv #846

corneliusroemer commented Feb 4, 2022 •

edited

Loading

victorlin commented Feb 4, 2022

corneliusroemer commented Feb 4, 2022

tsibley commented Feb 4, 2022

victorlin commented Feb 4, 2022

tsibley commented Feb 4, 2022

huddlej left a comment •

edited

Loading

corneliusroemer left a comment

corneliusroemer commented Feb 7, 2022

codecov bot commented Feb 7, 2022 •

edited

Loading

corneliusroemer commented Feb 7, 2022

rneher Feb 7, 2022

corneliusroemer Feb 7, 2022

rneher Feb 7, 2022

rneher Feb 7, 2022

rneher Feb 7, 2022

corneliusroemer commented Feb 10, 2022

tsibley commented Feb 10, 2022

victorlin commented Feb 10, 2022

Allow inheritance in clades.tsv #846

Allow inheritance in clades.tsv #846

Conversation

corneliusroemer commented Feb 4, 2022 • edited Loading

Description of proposed changes

Related issue(s)

Testing

victorlin commented Feb 4, 2022

corneliusroemer commented Feb 4, 2022

tsibley commented Feb 4, 2022

victorlin commented Feb 4, 2022

tsibley commented Feb 4, 2022

huddlej left a comment • edited Loading

Choose a reason for hiding this comment

corneliusroemer left a comment

Choose a reason for hiding this comment

corneliusroemer commented Feb 7, 2022

codecov bot commented Feb 7, 2022 • edited Loading

Codecov Report

corneliusroemer commented Feb 7, 2022

rneher Feb 7, 2022

Choose a reason for hiding this comment

corneliusroemer Feb 7, 2022

Choose a reason for hiding this comment

rneher Feb 7, 2022

Choose a reason for hiding this comment

rneher Feb 7, 2022

Choose a reason for hiding this comment

rneher Feb 7, 2022

Choose a reason for hiding this comment

corneliusroemer commented Feb 10, 2022

tsibley commented Feb 10, 2022

victorlin commented Feb 10, 2022

corneliusroemer commented Feb 4, 2022 •

edited

Loading

huddlej left a comment •

edited

Loading

codecov bot commented Feb 7, 2022 •

edited

Loading