-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phylogenetic #8
Merged
Merged
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
53246bf
Move phylogenetic workflow to phylogenetic directory
j23414 bbb7e77
Add copy example data custom rules
j23414 4b3c822
Since lassa has S and L segments
j23414 cf59a92
Update the CI
j23414 ecb6aa3
Move rules for preparing sequences to its own smk file
j23414 1fd7d55
Move rules for constructing phylogeny to its own smk file
j23414 c3fa8f6
Move rules for annotating phylogeny to its own smk file
j23414 ee0135a
Move rule for exporting auspice json to its own smk file
j23414 c078718
Move config values to config file
j23414 05dcd7d
Update augur export v1 to v2
j23414 5bfd527
Move config to defaults to match pathogen-repo-guide
j23414 003ecfc
Add description statement
j23414 4d5aeec
Copy phylogenetic instructions from pathogen-repo-guide
j23414 d81791c
Download sequences and metadata from data.nextstrain.org
j23414 d7b5931
Pass curated GenBank data through the rest of pipeline
j23414 ee21b9f
Bypass duplicate reference strain detected
j23414 543de0b
Fixup: Add description statement
j23414 de8645d
Fixup example sequences to ID on accession
j23414 fa12fbd
Fixup AmbiguousRuleException
j23414 c5f87ae
Add rule to autogenerate colors
j23414 8ba2317
Display strain name on tree
j23414 2553ebc
Attribution
j23414 689800e
Add phylogenetic automation and deploy
j23414 f818c4b
Separate files into segment directories
j23414 e4d25fb
Update description to match https://nextstrain.org/lassa/s
j23414 ecd6ac9
Fixup: Update description to match https://nextstrain.org/lassa/s
j23414 3eb4a8d
Update .github/workflows/ingest-to-phylogenetic.yaml
j23414 7e177ea
ingest: Switch to lowercase segment names
j23414 072da67
phylogenetic: Switch to lowercase segment names
j23414 81d1cd1
Stage the phylogenetic build to get feedback from SME before making i…
j23414 7cde259
Since number of S and L segment sequences are both below 5k, include …
j23414 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
name: Ingest to phylogenetic | ||
|
||
defaults: | ||
run: | ||
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023: | ||
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell | ||
# | ||
# Completely spelling it out here so that GitHub can't change it out from under us | ||
# and we don't have to refer to the docs to know the expected behavior. | ||
shell: bash --noprofile --norc -eo pipefail {0} | ||
|
||
on: | ||
schedule: | ||
# Note times are in UTC, which is 1 or 2 hours behind CET depending on daylight savings. | ||
# | ||
# Note the actual runs might be late. | ||
# Numerous people were confused, about that, including me: | ||
# - https://github.ghproxy.topmunity/t/scheduled-action-running-consistently-late/138025/11 | ||
# - https://github.com/github/docs/issues/3059 | ||
# | ||
# Note, '*' is a special character in YAML, so you have to quote this string. | ||
# | ||
# Docs: | ||
# - https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#schedule | ||
# | ||
# Tool that deciphers this particular format of crontab string: | ||
# - https://crontab.guru/ | ||
# | ||
# Runs at 5pm UTC (1pm EDT/10am PDT) since curation by NCBI happens on the East Coast. | ||
# We were running into invalid zip archive errors at 9am PDT, so hoping an hour | ||
# delay will lower the error frequency | ||
- cron: '0 17 * * *' | ||
|
||
workflow_dispatch: | ||
inputs: | ||
ingest_image: | ||
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")' | ||
required: false | ||
phylogenetic_image: | ||
description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")' | ||
required: false | ||
|
||
jobs: | ||
ingest: | ||
permissions: | ||
id-token: write | ||
uses: ./.github/workflows/ingest.yaml | ||
secrets: inherit | ||
with: | ||
image: ${{ inputs.ingest_image }} | ||
|
||
# Check if ingest results include new data by checking for the cache | ||
# of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3) | ||
# GitHub will remove any cache entries that have not been accessed in over 7 days, | ||
# so if the workflow has not been run over 7 days then it will trigger phylogenetic. | ||
check-new-data: | ||
needs: [ingest] | ||
runs-on: ubuntu-latest | ||
outputs: | ||
cache-hit: ${{ steps.check-cache.outputs.cache-hit }} | ||
steps: | ||
- name: Get sha256sum | ||
id: get-sha256sum | ||
env: | ||
AWS_DEFAULT_REGION: ${{ vars.AWS_DEFAULT_REGION }} | ||
run: | | ||
s3_urls=( | ||
"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst" | ||
"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst" | ||
) | ||
|
||
# Code below is modified from ingest/upload-to-s3 | ||
# https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29 | ||
|
||
no_hash=0000000000000000000000000000000000000000000000000000000000000000 | ||
|
||
for s3_url in "${s3_urls[@]}"; do | ||
s3path="${s3_url#s3://}" | ||
bucket="${s3path%%/*}" | ||
key="${s3path#*/}" | ||
|
||
s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")" | ||
echo "${s3_hash}" | tee -a ingest-output-sha256sum | ||
done | ||
|
||
- name: Check cache | ||
id: check-cache | ||
uses: actions/cache@v4 | ||
with: | ||
path: ingest-output-sha256sum | ||
key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }} | ||
lookup-only: true | ||
|
||
phylogenetic: | ||
needs: [check-new-data] | ||
if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }} | ||
permissions: | ||
id-token: write | ||
uses: ./.github/workflows/phylogenetic.yaml | ||
secrets: inherit | ||
with: | ||
image: ${{ inputs.phylogenetic_image }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
name: Phylogenetic | ||
|
||
defaults: | ||
run: | ||
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023: | ||
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell | ||
# | ||
# Completely spelling it out here so that GitHub can't change it out from under us | ||
# and we don't have to refer to the docs to know the expected behavior. | ||
shell: bash --noprofile --norc -eo pipefail {0} | ||
|
||
on: | ||
workflow_call: | ||
inputs: | ||
image: | ||
description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")' | ||
required: false | ||
type: string | ||
|
||
workflow_dispatch: | ||
inputs: | ||
image: | ||
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")' | ||
required: false | ||
type: string | ||
trial_name: | ||
description: | | ||
Trial name for deploying builds. | ||
If not set, builds will overwrite existing builds at s3://nextstrain-data/lassa* | ||
If set, builds will be deployed to s3://nextstrain-staging/lassa_trials_<trial_name>_* | ||
required: false | ||
type: string | ||
sequences_url: | ||
description: | | ||
URL for the sequences.fasta.zst file | ||
If not provided, will use default sequences_url from phylogenetic/defaults/config.yaml | ||
required: false | ||
type: string | ||
metadata_url: | ||
description: | | ||
URL for the metadata.tsv.zst file | ||
If not provided, will use default metadata_url from phylogenetic/defaults/config.yaml | ||
required: false | ||
type: string | ||
|
||
jobs: | ||
set_config_overrides: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- id: config | ||
name: Set config overrides | ||
env: | ||
TRIAL_NAME: ${{ inputs.trial_name }} | ||
SEQUENCES_URL: ${{ inputs.sequences_url }} | ||
METADATA_URL: ${{ inputs.metadata_url }} | ||
run: | | ||
config="" | ||
|
||
if [[ "$TRIAL_NAME" ]]; then | ||
config+=" deploy_url='s3://nextstrain-staging/lassa_trials_"$TRIAL_NAME"_'" | ||
fi | ||
|
||
if [[ "$SEQUENCES_URL" ]]; then | ||
config+=" sequences_url='"$SEQUENCES_URL"'" | ||
fi | ||
|
||
if [[ "$METADATA_URL" ]]; then | ||
config+=" metadata_url='"$METADATA_URL"'" | ||
fi | ||
|
||
if [[ $config ]]; then | ||
config="--config $config" | ||
fi | ||
|
||
echo "config=$config" >> "$GITHUB_OUTPUT" | ||
outputs: | ||
config_overrides: ${{ steps.config.outputs.config }} | ||
|
||
phylogenetic: | ||
needs: [set_config_overrides] | ||
permissions: | ||
id-token: write | ||
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master | ||
secrets: inherit | ||
with: | ||
# Starting with the default docker runtime | ||
# We can migrate to AWS Batch when/if we need to for more resources or if | ||
# the job runs longer than the GH Action limit of 6 hours. | ||
runtime: docker | ||
env: | | ||
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.image }} | ||
CONFIG_OVERRIDES: ${{ needs.set_config_overrides.outputs.config_overrides }} | ||
run: | | ||
nextstrain build \ | ||
phylogenetic \ | ||
deploy_all \ | ||
--configfile build-configs/nextstrain-automation/config.yaml \ | ||
$CONFIG_OVERRIDES | ||
# Specifying artifact name to differentiate ingest build outputs from | ||
# the phylogenetic build outputs | ||
artifact-name: phylogenetic-build-output | ||
artifact-paths: | | ||
phylogenetic/auspice/ | ||
phylogenetic/results/ | ||
phylogenetic/benchmarks/ | ||
phylogenetic/logs/ | ||
phylogenetic/.snakemake/log/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
custom_rules: | ||
- build-configs/nextstrain-automation/deploy.smk | ||
|
||
deploy_url: "s3://nextstrain-data" |
15 changes: 15 additions & 0 deletions
15
phylogenetic/build-configs/nextstrain-automation/deploy.smk
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
""" | ||
This part of the workflow handles automatic deployments of the lassa build. | ||
Uploads the build defined as the default output of the workflow through | ||
the `all` rule from Snakefille | ||
""" | ||
|
||
rule deploy_all: | ||
input: *rules.all.input | ||
output: touch("results/deploy_all.done") | ||
params: | ||
deploy_url = config["deploy_url"] | ||
shell: | ||
""" | ||
nextstrain remote upload {params.deploy_url} {input} | ||
""" |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These URLs need to be updated based on the current upload config
Side question, should these check the L/S files since they are the files used by the phylogenetic workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Considering that this same workflow in dengue only checks for the 'all' serotype, I believe this approach should be sufficient? Since the 'all', 'l', and 's' files are updated concurrrently, they should equally trigger the phylogenetic workflow.
However, since there is no such thing as an 'all' tree for lassa (unless we concatenated segments) and if we later decide that the
all
dataset is not necessary for debugging, I could see using either 'l' or 's' instead, just in case.