-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude future dates #401
Exclude future dates #401
Conversation
@trvrb This arose from a suggestion of mine in nextstrain/ncov-ingest#33. The goal is to reduce manual annotation curation while making what's excluded very transparent in the same way that config/exclude.txt is. I agree that the build will need to work for everyone, of course. The fetch from s3 can either be optional or we could even make that object public. I also see ncov-ingest as ideally ending up incorporated into this repo as part of the build (esp. now that it's public). |
85148ce
to
5a3b0f6
Compare
@trvrb does having an optional fetch or a publicly available S3 bucket where the automatically generated |
That would be fantastic 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kairstenfay Would you rebase this onto the latest master while you're making the tweaks below?
rules/builds.smk
Outdated
shell: | ||
""" | ||
aws s3 cp s3://nextstrain-ncov-private/exclude.txt - >> {output.exclude:q} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of modifying the hardcoded config/exclude.txt in place, this rule would be better if it took config["files"]["exclude"]
as its input and produced a separate file as its output, something like results/exclude.txt
maybe. Then the filter
rule immediately following would be adjusted to take rules.combine_exclude_files.output.exclude
as its input.
This avoids modifying a git-tracked file, which will affect the git repo's dirty/clean status, and avoids hardcoding the input base excludes file.
The command should probably also make sure when it appends the remote excludes file that there's a trailing newline in the base file. I've noticed that not everyone's editor ensures this and some files are missing the trailing newline (which breaks a lot of assumptions that Unix tools make).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in the latest commit
edit: Oops, the snakemake input/output changes only. Still need to ensure trailing newline.
5a3b0f6
to
147c323
Compare
Ah, and as you noted in the related ncov-ingest PR, we'll also need to make the combining rule optional. |
147c323
to
890f500
Compare
@tsibley Is the download rule no longer optional? I recall it was at one point but |
rules/nextstrain_exports.smk
Outdated
shell: | ||
""" | ||
nextstrain deploy {params.s3_staging_url:q} {input:q} | ||
if config["connect_to_s3"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps if not config["connect_to_s3"]
, we don't want to do any of the deployment and error handling setup above this line.
@tsibley in my latest commit, I propose a solution to making S3 downloads and deploys optional. It required creating a new config file, |
0077059
to
a55d2a5
Compare
I just rebased onto master and pushed a new commit that excludes future dates using I updated the PR title and initial message to reflect these changes. I did not drop the |
@kairstenfay Would you mind pulling out the |
a55d2a5
to
83ace8e
Compare
That makes perfect sense. I created #437 that contains the commits now dropped from this PR. |
rules/builds.smk
Outdated
@@ -35,6 +37,7 @@ rule filter: | |||
--sequences {input.sequences} \ | |||
--metadata {input.metadata} \ | |||
--include {input.include} \ | |||
--max-date {input.date} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something went wrong with the rebase, as the definition of input.date
disappeared for this rule. In any case, I think better to make date a param instead of an input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I made an error in rebasing. params.date is defined now.
I'm curious, what is the distinction between input and params in Snakemake? To me it feels arbitrary because we use entries from config
in both. I tried looking through the Snakemake docs but couldn't find anything on this.
After talking with @joverlee521 , my understanding is it that input generally is for files and params is for other values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I concur with @joverlee521. Snakemake describes the motivation for params
as well in its docs: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules
83ace8e
to
0a38a36
Compare
Hmm... augur filter with |
I get this error, but only when I include
Digging through the augur filter code it looks like |
I get the same error. It works when I hardcode the max-date to |
I believe this may have only ever have been intended to be used for years.... Might be worth peeking at the code to see if this is the case. Sorry if this is the case and I didn't remember this earlier...! Otherwise, there are some functions in augur that will convert to decimal date, I think (or Treetime), which could maybe be used to convert today into a decimal |
Oof. Dates as floats are pretty common in Augur-land (and have caused consternation in the past). I see two options, which aren't mututally exclusive:
|
from treetime.utils import numeric_date
…
date: numeric_date(date.today()) should do the trick for the first option. |
Thank you, and no worries at all... I swear |
Thanks for laying out these options, Tom. I'm leaning towards extending augur in a backwards compatible way, because it would be nice for augur filter to accept ISO 8601 dates which many users (like myself) might expect. |
Filter out sequences with dates set in the future using augur filter's `--max-date` option. Note that `--max-date` should be a float date.
0a38a36
to
875a9cf
Compare
Ok, I'll mark option 2 as "aspirational" for now. I filed an issue at nextstrain/augur/issues/567. As @tsibley mentioned, the two solutions he proposed are not mutually exclusive. So, the easiest thing to do for now is convert today's date to a numeric date with |
Thank you guys, and thanks @kairstenfay for making an issue in |
Exclude future dates with
augur filter --max-date
.This PR previously concatenated an automatically generated
exclude.txt
file created byncov-ingest
with the manually curated one in this repo, but now those concatenation steps are no longer necessary. Down the road, we may want to incorporate automatic exclusions, so don't drop the commits here.Depends on nextstrain/ncov-ingest#43
Related issue(s)
nextstrain/ncov-ingest#33