Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude future dates #401

Merged
merged 1 commit into from
Jun 17, 2020
Merged

Exclude future dates #401

merged 1 commit into from
Jun 17, 2020

Conversation

kairstenfay
Copy link
Contributor

@kairstenfay kairstenfay commented May 12, 2020

Exclude future dates with augur filter --max-date.

This PR previously concatenated an automatically generated exclude.txt file created by ncov-ingest with the manually curated one in this repo, but now those concatenation steps are no longer necessary. Down the road, we may want to incorporate automatic exclusions, so don't drop the commits here.

Depends on nextstrain/ncov-ingest#43

Related issue(s)

nextstrain/ncov-ingest#33

@kairstenfay kairstenfay requested a review from tsibley May 12, 2020 03:56
@kairstenfay
Copy link
Contributor Author

@tsibley
Copy link
Member

tsibley commented May 12, 2020

@trvrb This arose from a suggestion of mine in nextstrain/ncov-ingest#33. The goal is to reduce manual annotation curation while making what's excluded very transparent in the same way that config/exclude.txt is. I agree that the build will need to work for everyone, of course. The fetch from s3 can either be optional or we could even make that object public. I also see ncov-ingest as ideally ending up incorporated into this repo as part of the build (esp. now that it's public).

@kairstenfay kairstenfay force-pushed the automatic-exclusions branch from 85148ce to 5a3b0f6 Compare May 12, 2020 17:02
@kairstenfay
Copy link
Contributor Author

@trvrb does having an optional fetch or a publicly available S3 bucket where the automatically generated exclude.txt is stored seem like workable solutions here?

@kairstenfay
Copy link
Contributor Author

I also see ncov-ingest as ideally ending up incorporated into this repo as part of the build (esp. now that it's public).

That would be fantastic 😄

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kairstenfay Would you rebase this onto the latest master while you're making the tweaks below?

rules/builds.smk Outdated
shell:
"""
aws s3 cp s3://nextstrain-ncov-private/exclude.txt - >> {output.exclude:q}
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of modifying the hardcoded config/exclude.txt in place, this rule would be better if it took config["files"]["exclude"] as its input and produced a separate file as its output, something like results/exclude.txt maybe. Then the filter rule immediately following would be adjusted to take rules.combine_exclude_files.output.exclude as its input.

This avoids modifying a git-tracked file, which will affect the git repo's dirty/clean status, and avoids hardcoding the input base excludes file.

The command should probably also make sure when it appends the remote excludes file that there's a trailing newline in the base file. I've noticed that not everyone's editor ensures this and some files are missing the trailing newline (which breaks a lot of assumptions that Unix tools make).

Copy link
Contributor Author

@kairstenfay kairstenfay May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest commit

edit: Oops, the snakemake input/output changes only. Still need to ensure trailing newline.

@tsibley
Copy link
Member

tsibley commented May 27, 2020

Ah, and as you noted in the related ncov-ingest PR, we'll also need to make the combining rule optional.

@kairstenfay kairstenfay force-pushed the automatic-exclusions branch from 147c323 to 890f500 Compare May 27, 2020 23:21
@kairstenfay
Copy link
Contributor Author

kairstenfay commented May 28, 2020

Ah, and as you noted in the related ncov-ingest PR, we'll also need to make the combining rule optional.

@tsibley Is the download rule no longer optional? I recall it was at one point but download fails, breaking the Snakemake build, if I don't provide AWS credentials.

shell:
"""
nextstrain deploy {params.s3_staging_url:q} {input:q}
if config["connect_to_s3"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps if not config["connect_to_s3"], we don't want to do any of the deployment and error handling setup above this line.

@kairstenfay
Copy link
Contributor Author

@tsibley in my latest commit, I propose a solution to making S3 downloads and deploys optional. It required creating a new config file, profiles/nextstrain/s3.yaml, in order to set the boolean connect_to_s3 variable to a different value in the default and nextstrain profiles.

@kairstenfay kairstenfay requested a review from tsibley May 28, 2020 23:38
@kairstenfay kairstenfay requested a review from joverlee521 June 5, 2020 20:40
@kairstenfay kairstenfay force-pushed the automatic-exclusions branch from 0077059 to a55d2a5 Compare June 6, 2020 17:28
@kairstenfay kairstenfay changed the title Concatenate automatic exclusions with exclude.txt Exclude future dates Jun 6, 2020
@kairstenfay
Copy link
Contributor Author

kairstenfay commented Jun 6, 2020

I just rebased onto master and pushed a new commit that excludes future dates using augur filter --max-date as suggested by @joverlee521 in nextstrain/ncov-ingest#43.

I updated the PR title and initial message to reflect these changes. I did not drop the exclude.txt concatenation commit from this PR, because down the road, we may have a reason to produce an automatically generated exclude.txt file (e.g. see nextstrain/ncov-ingest#45).

@tsibley
Copy link
Member

tsibley commented Jun 16, 2020

@kairstenfay Would you mind pulling out the --max-date commit and merging just that for now? I do think we should keep the other commits around in case of future work but would prefer to leave them in an open PR until we need them. Generally, I find that minimizing unused code in the main branch makes a project easier to work with and understand.

@kairstenfay
Copy link
Contributor Author

@kairstenfay Would you mind pulling out the --max-date commit and merging just that for now? I do think we should keep the other commits around in case of future work but would prefer to leave them in an open PR until we need them. Generally, I find that minimizing unused code in the main branch makes a project easier to work with and understand.

That makes perfect sense. I created #437 that contains the commits now dropped from this PR.

rules/builds.smk Outdated
@@ -35,6 +37,7 @@ rule filter:
--sequences {input.sequences} \
--metadata {input.metadata} \
--include {input.include} \
--max-date {input.date} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something went wrong with the rebase, as the definition of input.date disappeared for this rule. In any case, I think better to make date a param instead of an input.

Copy link
Contributor Author

@kairstenfay kairstenfay Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I made an error in rebasing. params.date is defined now.

I'm curious, what is the distinction between input and params in Snakemake? To me it feels arbitrary because we use entries from config in both. I tried looking through the Snakemake docs but couldn't find anything on this.

After talking with @joverlee521 , my understanding is it that input generally is for files and params is for other values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I concur with @joverlee521. Snakemake describes the motivation for params as well in its docs: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules

@kairstenfay
Copy link
Contributor Author

Hmm... augur filter with --max-date is no longer working for me locally when I run snakemake. I wonder if you all are experiencing the same?

@kairstenfay
Copy link
Contributor Author

kairstenfay commented Jun 17, 2020

I get this error, but only when I include --max-date:

Error in rule filter:
    jobid: 137
    output: results/filtered.fasta
    log: logs/filtered.txt (check log file(s) for error message)
    shell:
        
        augur filter             --sequences data/sequences.fasta             --metadata data/metadata.tsv             --include config/include.txt             --max-date 2020-06-16             --exclude config/exclude.txt             --exclude-where division='USA' date='2020' date='2020-01-XX' date='2020-02-XX' date='2020-03-XX' date='2020-04-XX' date='2020-05-XX' date='2020-06-XX' date='2020-01' date='2020-02' date='2020-03' date='2020-04' date='2020-05' date='2020-06'            --min-length 27000             --output results/filtered.fasta 2>&1 | tee logs/filtered.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Digging through the augur filter code it looks like --max-date should be a float?

@joverlee521
Copy link
Contributor

Digging through the augur filter code it looks like --max-date should be a float?

I get the same error. It works when I hardcode the max-date to 2020.06. Why is this expected to be a float???

@emmahodcroft
Copy link
Member

I believe this may have only ever have been intended to be used for years.... Might be worth peeking at the code to see if this is the case. Sorry if this is the case and I didn't remember this earlier...! Otherwise, there are some functions in augur that will convert to decimal date, I think (or Treetime), which could maybe be used to convert today into a decimal

@tsibley
Copy link
Member

tsibley commented Jun 17, 2020

Oof. Dates as floats are pretty common in Augur-land (and have caused consternation in the past). I see two options, which aren't mututally exclusive:

  1. Convert date.today() to Augur's style of floating point date in the Snakefile before passing to --max-date.

  2. Extend Augur's --min-date and --max-date to also accept a YYYY-MM-DD string. Internally, it would still likely convert to a floating point date for the comparison, but as Emma points out, functions exist to do that already. This should be backwards compatible and could be added relatively easily.

@tsibley
Copy link
Member

tsibley commented Jun 17, 2020

from treetime.utils import numeric_datedate: numeric_date(date.today())

should do the trick for the first option.

@kairstenfay
Copy link
Contributor Author

I believe this may have only ever have been intended to be used for years.... Might be worth peeking at the code to see if this is the case. Sorry if this is the case and I didn't remember this earlier...! Otherwise, there are some functions in augur that will convert to decimal date, I think (or Treetime), which could maybe be used to convert today into a decimal

Thank you, and no worries at all... I swear --max-date used to work for me locally, but I could be misremembering.

@kairstenfay
Copy link
Contributor Author

Oof. Dates as floats are pretty common in Augur-land (and have caused consternation in the past). I see two options, which aren't mututally exclusive:

1. Convert `date.today()` to Augur's style of floating point date in the Snakefile before passing to `--max-date`.

2. Extend Augur's `--min-date` and `--max-date` to also accept a `YYYY-MM-DD` string. Internally, it would still likely convert to a floating point date for the comparison, but as Emma points out, functions exist to do that already. This should be backwards compatible and could be added relatively easily.

Thanks for laying out these options, Tom. I'm leaning towards extending augur in a backwards compatible way, because it would be nice for augur filter to accept ISO 8601 dates which many users (like myself) might expect.

Filter out sequences with dates set in the future using augur filter's
`--max-date` option. Note that `--max-date` should be a float date.
@kairstenfay kairstenfay force-pushed the automatic-exclusions branch from 0a38a36 to 875a9cf Compare June 17, 2020 17:40
@kairstenfay
Copy link
Contributor Author

Thanks for laying out these options, Tom. I'm leaning towards extending augur in a backwards compatible way, because it would be nice for augur filter to accept ISO 8601 dates which many users (like myself) might expect.

Ok, I'll mark option 2 as "aspirational" for now. I filed an issue at nextstrain/augur/issues/567.

As @tsibley mentioned, the two solutions he proposed are not mutually exclusive. So, the easiest thing to do for now is convert today's date to a numeric date with treetime.utils. I implemented the first solution Tom proposed and fixed it up into the single commit here.

@kairstenfay kairstenfay requested a review from tsibley June 17, 2020 17:42
@kairstenfay kairstenfay merged commit 196cdef into master Jun 17, 2020
@kairstenfay kairstenfay deleted the automatic-exclusions branch June 17, 2020 18:50
@emmahodcroft
Copy link
Member

Thank you guys, and thanks @kairstenfay for making an issue in augur! I think this would be great to expand augur filter to - but yes, probably easiest not to hold up this PR in the meantime :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants