Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsample has uncaught assertion in ncov #1598

Closed
corneliusroemer opened this issue Aug 23, 2024 · 3 comments · Fixed by #1599
Closed

Subsample has uncaught assertion in ncov #1598

corneliusroemer opened this issue Aug 23, 2024 · 3 comments · Fixed by #1599
Assignees
Labels
bug Something isn't working

Comments

@corneliusroemer
Copy link
Member

ncov errors last night. It seems to be related to filter/subsample?

Here's the log: https://github.com/nextstrain/ncov/actions/runs/10521511438/job/29152346864#step:5:1

[batch] [2024-08-23T07:36:29+00:00]         Subsample all sequences by 'context_early' scheme for build 'south-america_1m' with the following parameters:
[batch] [2024-08-23T07:36:29+00:00]          - group by: --group-by country year month
[batch] [2024-08-23T07:36:29+00:00]          - sequences per group: 
[batch] [2024-08-23T07:36:29+00:00]          - subsample max sequences: --subsample-max-sequences 160
[batch] [2024-08-23T07:36:29+00:00]          - min-date: 
[batch] [2024-08-23T07:36:29+00:00]          - max-date: --max-date 1M
[batch] [2024-08-23T07:36:29+00:00]          - 
[batch] [2024-08-23T07:36:29+00:00]          - exclude: --exclude-where 'region=South America'
[batch] [2024-08-23T07:36:29+00:00]          - include: 
[batch] [2024-08-23T07:36:29+00:00]          - query: 
[batch] [2024-08-23T07:36:29+00:00]          - priority: 
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:29+00:00] Reason: Missing output files: results/south-america_1m/sample-context_early.txt; Input files updated by another job: results/gisaid_21L_metadata.tsv.zst
[batch] [2024-08-23T07:36:29+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt                          --max-date 1M             --exclude-where 'region=South America'                                                                 --group-by country year month                                       --subsample-max-sequences 160                          --output-strains results/south-america_1m/sample-context_early.txt 2>&1 | tee logs/subsample_south-america_1m_context_early.txt
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:48+00:00] Sampling at 1 per group.
[batch] [2024-08-23T07:36:55+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:55+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:55+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:55+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:55+00:00]     return _run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:55+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:55+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:55+00:00] AssertionError
[batch] [2024-08-23T07:36:55+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:55+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:55+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00] [Fri Aug 23 07:36:55 2024]
[batch] [2024-08-23T07:36:56+00:00] Error in rule subsample:
[batch] [2024-08-23T07:36:56+00:00]     jobid: 108
[batch] [2024-08-23T07:36:56+00:00]     input: results/gisaid_21L_metadata.tsv.zst, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, defaults/exclude.txt
[batch] [2024-08-23T07:36:56+00:00]     output: results/global_2m/sample-north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]     log: logs/subsample_global_2m_north_america_recent.txt (check log file(s) for error details)
[batch] [2024-08-23T07:36:56+00:00]     conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-08-23T07:36:56+00:00]     shell:
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt             --min-date 2M                          --exclude-where 'region!=North America'                                                                 --group-by division week                                       --subsample-max-sequences 400                          --output-strains results/global_2m/sample-north_america_recent.txt 2>&1 | tee logs/subsample_global_2m_north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-08-23T07:36:56+00:00] Logfile logs/subsample_global_2m_north_america_recent.txt:
[batch] [2024-08-23T07:36:56+00:00] ================================================================================
[batch] [2024-08-23T07:36:56+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:56+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:56+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:56+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:56+00:00]     return _run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:56+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:56+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:56+00:00] AssertionError
[batch] [2024-08-23T07:36:56+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:56+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:56+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00] 
@corneliusroemer corneliusroemer added the bug Something isn't working label Aug 23, 2024
@victorlin victorlin self-assigned this Aug 23, 2024
@victorlin
Copy link
Member

Given the timing, release of #1454 yesterday was my initial suspicion. That PR moved this code around and I thought maybe the conditions got messed up in the change to augur/filter/_run.py.

A deeper inspection shows that it's unrelated and I believe the timing is just coincidence. The error is an approximation issue that can be fixed by addressing #1588:

400 / 406
# 0.9852 <- "exact" target_group_size is less than 1 which will pass the assertion

augur.filter.subsample._calculate_fractional_sequences_per_group(400, [1,]*406)
# 1.0254 <- "approximated" target_group_size is greater than 1 which fails the assertion

@corneliusroemer
Copy link
Member Author

Interesting, thanks for the quick investigation!

@victorlin
Copy link
Member

Yesterday's scheduled run was successful. I downloaded the relevant log file for the failing run and the flanking succeeding runs

nextstrain build --aws-batch --attach <batch job id> --download 'logs/subsample_global_2m_north_america_recent.txt' ~/tmp

and confirmed that this is due to approximation issue:

  • 2024-08-20:

    WARNING: Asked to provide at most 400 sequences, but there are 412 groups.
    Sampling probabilistically at 0.9522 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
    
  • 2024-08-23:

    WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
    Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
    
  • 2024-08-25:

    WARNING: Asked to provide at most 400 sequences, but there are 432 groups.
    Sampling probabilistically at 0.9033 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants