Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow inheritance in clades.tsv #846

Merged
merged 5 commits into from
Feb 7, 2022
Merged

Allow inheritance in clades.tsv #846

merged 5 commits into from
Feb 7, 2022

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Feb 4, 2022

Description of proposed changes

This PR extends read_in_clade_definitions in clades.py to allow overwritable inheritance using the following syntax:

    clade      gene    site     alt
    Clade_1    ctpE    81       D
    Clade_2    nuc     30642    T
    Clade_3    nuc     444296   A
    Clade_3    S       1        P
    # Clade_4 inherits from Clade_3
    Clade_4    clade   Clade_3
    Clade_4    pks8    634      T
    # Inherited allele can be overwritten
    Clade_4    S       1        L

Significant simplification of clades.tsv using this syntax can be observed in this ncov PR: https://github.com/nextstrain/ncov/pull/855/files

It uses networkx to create a directed graph of clades, to traverse in topological order and check that the graph is a DAG.

Helpful errors are raised whenever the graph is not a proper DAG.

I also enable comments in clade tsv by adding the kwarg comment='#' in the pd.read_csv

The function is rigorously unit tested by test_clades.py with 100% coverage, including the raising of exceptions.

Note: It may be necessary to add networkx as a dependency, if it isn't one already.

Related issue(s)

Resolves #823

Testing

Clades.py does not have any integration tests yet. I added unit tests just for the read_in_clade_definitions function. But given that the change is very local, I don't expect anything to break.

@victorlin
Copy link
Member

Tests are failing with ModuleNotFoundError – you'll want to add networkx here:

augur/setup.py

Lines 53 to 61 in 780e3d3

install_requires = [
"bcbio-gff >=0.6.0, ==0.6.*",
"biopython >=1.67, !=1.77, !=1.78",
"jsonschema >=3.0.0, ==3.*",
"packaging >=19.2",
"pandas >=1.0.0, ==1.*",
"phylo-treetime ==0.8.*",
"xopen >=1.0.1, ==1.*"
],

@corneliusroemer
Copy link
Member Author

Thanks Victor for pointing me to where I need to change. You can do it to if you like, it's a branch on ncov itself for everyone to edit.

I'd be particularly interested in feedback on testing: is there a neater way than creating a new TSV for every test case?

Would it make sense to construct the TSVs test files programmatically to be DRYer? Or is it just fine like this, as it's nice and easy to reason?

@tsibley
Copy link
Member

tsibley commented Feb 4, 2022

Would it make sense to construct the TSVs test files programmatically to be DRYer? Or is it just fine like this, as it's nice and easy to reason?

I think this is just fine and preferable over generation to me. It's easier to reason when you can look at the static file, and there's no harm in a handful of additional test data files.

@victorlin
Copy link
Member

I'd be particularly interested in feedback on testing: is there a neater way than creating a new TSV for every test case?

Not following any convention, but what I did for test_filter_groupby.py was create valid source data (pd.DataFrame) in a pytest fixture:

@pytest.fixture
def valid_metadata() -> pd.DataFrame:
columns = ['strain', 'date', 'country']
data = [
("SEQ_1","2020-01-XX","A"),
("SEQ_2","2020-02-01","A"),
("SEQ_3","2020-03-01","B"),
("SEQ_4","2020-04-01","B"),
("SEQ_5","2020-05-01","B")
]
return pd.DataFrame.from_records(data, columns=columns).set_index('strain')

Then "break" it in individual unit tests like so:

metadata = valid_metadata.copy()
metadata.at["SEQ_2", "date"] = "XXXX-02-01"

Though this approach isn't directly usable by your code as-is since read_in_clade_definitions takes a file instead of DataFrame. You could refactor to have something like read_in_clade_definitions_from_df as an intermediate, which is then testable using my approach, but that might be too much for this PR.

@tsibley
Copy link
Member

tsibley commented Feb 4, 2022

Alternatively, the TSVs are short enough that it would be reasonable IMO to inline them as strings ("""…""") directly into the test file and then write them out to temp files at runtime for passing to read_in_clade_definitions(). (pytest offers some features for passing a ready-to-use temp file to your test functions.) This maintains the nice property @victorlin mentions of being able to see the input data right above the expected output. I would advocate against constructing a DataFrame directly, as then we're not testing the full function which in practice does actually read a TSV file.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, @corneliusroemer. The implementation is clean and the tests are easy to follow. For this type of data, I prefer the standalone TSV files to the inline data in Python because I could easily load these data into pandas from the Python prompt and test out alternate approaches to data processing.

I made a couple of suggestions below that could make read_in_clade_definitions a little easier to read and maintain.

Edit: We will need to include networkx as a dependency in our Bioconda recipe, too.

Copy link
Member Author

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've resolved all of your change requests @huddlej

@corneliusroemer
Copy link
Member Author

We will need to include networkx as a dependency in our Bioconda recipe, too.

Yes, this is important. If we don't yet do it, it'd be good to run proper tests on our Bioconda recipe too, so we would catch things like this automatically in case we happen to forget.

@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #846 (c4bc45a) into master (84f11b4) will increase coverage by 0.29%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
+ Coverage   33.93%   34.22%   +0.29%     
==========================================
  Files          41       41              
  Lines        5902     5922      +20     
  Branches     1507     1515       +8     
==========================================
+ Hits         2003     2027      +24     
+ Misses       3817     3813       -4     
  Partials       82       82              
Impacted Files Coverage Δ
augur/clades.py 33.33% <100.00%> (+20.96%) ⬆️
augur/__main__.py 0.00% <0.00%> (-75.00%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84f11b4...c4bc45a. Read the comment docs.

@corneliusroemer
Copy link
Member Author

I've added a simple functional cram test using data from Zika. I removed all the sequences contained in the node data jsons to reduce impact on repo size.

if not nx.is_directed_acyclic_graph(G):
raise ValueError(f"Clade definitions contain cycles {list(nx.simple_cycles(G))}")

for clade in islice(nx.topological_sort(G),1,None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments that explain what these functions do would be helpful. This is some sort of traversal, I guess, but can't tell from the code what kind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the comments I added help?

augur/clades.py Outdated
raise ValueError(f"Clade definitions contain cycles {list(nx.simple_cycles(G))}")

for clade in islice(nx.topological_sort(G),1,None):
clades[clade] = clades[next(G.predecessors(clade))].copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. this is the parent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it adds definitions from parent clade to daughter clade

clades[clade][(row.gene, int(row.site)-1)] = row.alt

# Convert items from dict[str, dict[(str,int),str]] to dict[str, list[(str,int,str)]]
clades = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe point out that mutations defined for child overwrite inherited definitions.

@huddlej huddlej added this to the Major release 14.0.0 milestone Feb 7, 2022
@huddlej huddlej merged commit ac48e7b into master Feb 7, 2022
@huddlej huddlej deleted the feat/clades-inheritance branch February 7, 2022 22:56
@corneliusroemer
Copy link
Member Author

@victorlin did you manually add networkx to the bioconda recipe or did the bioconda bump bot add the new dependency automatically?

It's definitely there which surprised me, thought I'd had to do it manually:
image

https://anaconda.org/bioconda/augur/files

@tsibley
Copy link
Member

tsibley commented Feb 10, 2022

@corneliusroemer Manually: bioconda/bioconda-recipes@1432155

@victorlin
Copy link
Member

Yeah, I caught the auto-bump PR bioconda/bioconda-recipes#32842 and told them to close it in favor of the manual PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inherited clade definitions
5 participants