Subsample has uncaught assertion in ncov #1598

corneliusroemer · 2024-08-23T14:54:49Z

ncov errors last night. It seems to be related to filter/subsample?

Here's the log: https://github.com/nextstrain/ncov/actions/runs/10521511438/job/29152346864#step:5:1

[batch] [2024-08-23T07:36:29+00:00]         Subsample all sequences by 'context_early' scheme for build 'south-america_1m' with the following parameters:
[batch] [2024-08-23T07:36:29+00:00]          - group by: --group-by country year month
[batch] [2024-08-23T07:36:29+00:00]          - sequences per group: 
[batch] [2024-08-23T07:36:29+00:00]          - subsample max sequences: --subsample-max-sequences 160
[batch] [2024-08-23T07:36:29+00:00]          - min-date: 
[batch] [2024-08-23T07:36:29+00:00]          - max-date: --max-date 1M
[batch] [2024-08-23T07:36:29+00:00]          - 
[batch] [2024-08-23T07:36:29+00:00]          - exclude: --exclude-where 'region=South America'
[batch] [2024-08-23T07:36:29+00:00]          - include: 
[batch] [2024-08-23T07:36:29+00:00]          - query: 
[batch] [2024-08-23T07:36:29+00:00]          - priority: 
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:29+00:00] Reason: Missing output files: results/south-america_1m/sample-context_early.txt; Input files updated by another job: results/gisaid_21L_metadata.tsv.zst
[batch] [2024-08-23T07:36:29+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt                          --max-date 1M             --exclude-where 'region=South America'                                                                 --group-by country year month                                       --subsample-max-sequences 160                          --output-strains results/south-america_1m/sample-context_early.txt 2>&1 | tee logs/subsample_south-america_1m_context_early.txt
[batch] [2024-08-23T07:36:29+00:00]         
[batch] [2024-08-23T07:36:48+00:00] Sampling at 1 per group.
[batch] [2024-08-23T07:36:55+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:55+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:55+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:55+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:55+00:00]     return _run(args)
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:55+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:55+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:55+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:55+00:00] AssertionError
[batch] [2024-08-23T07:36:55+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:55+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:55+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00] [Fri Aug 23 07:36:55 2024]
[batch] [2024-08-23T07:36:56+00:00] Error in rule subsample:
[batch] [2024-08-23T07:36:56+00:00]     jobid: 108
[batch] [2024-08-23T07:36:56+00:00]     input: results/gisaid_21L_metadata.tsv.zst, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, nextstrain_profiles/nextstrain-gisaid-21L/include.txt, defaults/exclude.txt
[batch] [2024-08-23T07:36:56+00:00]     output: results/global_2m/sample-north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]     log: logs/subsample_global_2m_north_america_recent.txt (check log file(s) for error details)
[batch] [2024-08-23T07:36:56+00:00]     conda-env: /nextstrain/build/.snakemake/conda/ef7f392b0ecf86741cd7c0bee42f4f0e_
[batch] [2024-08-23T07:36:56+00:00]     shell:
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         augur filter             --metadata results/gisaid_21L_metadata.tsv.zst             --include nextstrain_profiles/nextstrain-gisaid-21L/include.txt             --exclude defaults/exclude.txt             --min-date 2M                          --exclude-where 'region!=North America'                                                                 --group-by division week                                       --subsample-max-sequences 400                          --output-strains results/global_2m/sample-north_america_recent.txt 2>&1 | tee logs/subsample_global_2m_north_america_recent.txt
[batch] [2024-08-23T07:36:56+00:00]         
[batch] [2024-08-23T07:36:56+00:00]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] [2024-08-23T07:36:56+00:00] Logfile logs/subsample_global_2m_north_america_recent.txt:
[batch] [2024-08-23T07:36:56+00:00] ================================================================================
[batch] [2024-08-23T07:36:56+00:00] WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
[batch] [2024-08-23T07:36:56+00:00] Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.
[batch] [2024-08-23T07:36:56+00:00] Traceback (most recent call last):
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/__init__.py", line 70, in run
[batch] [2024-08-23T07:36:56+00:00]     return args.__command__.run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/__init__.py", line 135, in run
[batch] [2024-08-23T07:36:56+00:00]     return _run(args)
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/_run.py", line 295, in run
[batch] [2024-08-23T07:36:56+00:00]     group_sizes = get_probabilistic_group_sizes(
[batch] [2024-08-23T07:36:56+00:00]   File "/nextstrain/augur/augur/filter/subsample.py", line 285, in get_probabilistic_group_sizes
[batch] [2024-08-23T07:36:56+00:00]     assert target_group_size < 1.0
[batch] [2024-08-23T07:36:56+00:00] AssertionError
[batch] [2024-08-23T07:36:56+00:00] An error occurred (see above) that has not been properly handled by Augur.
[batch] [2024-08-23T07:36:56+00:00] To report this, please open a new issue including the original command and the error above:
[batch] [2024-08-23T07:36:56+00:00]     <https://github.com/nextstrain/augur/issues/new/choose>
[batch] [2024-08-23T07:36:56+00:00]

The text was updated successfully, but these errors were encountered:

victorlin · 2024-08-23T16:13:50Z

Given the timing, release of #1454 yesterday was my initial suspicion. That PR moved this code around and I thought maybe the conditions got messed up in the change to augur/filter/_run.py.

A deeper inspection shows that it's unrelated and I believe the timing is just coincidence. The error is an approximation issue that can be fixed by addressing #1588:

400 / 406
# 0.9852 <- "exact" target_group_size is less than 1 which will pass the assertion

augur.filter.subsample._calculate_fractional_sequences_per_group(400, [1,]*406)
# 1.0254 <- "approximated" target_group_size is greater than 1 which fails the assertion

corneliusroemer · 2024-08-23T17:18:39Z

Interesting, thanks for the quick investigation!

victorlin · 2024-08-26T17:21:44Z

Yesterday's scheduled run was successful. I downloaded the relevant log file for the failing run and the flanking succeeding runs

nextstrain build --aws-batch --attach <batch job id> --download 'logs/subsample_global_2m_north_america_recent.txt' ~/tmp

and confirmed that this is due to approximation issue:

✅ 2024-08-20:

WARNING: Asked to provide at most 400 sequences, but there are 412 groups.
Sampling probabilistically at 0.9522 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.

❌ 2024-08-23:

WARNING: Asked to provide at most 400 sequences, but there are 406 groups.
Sampling probabilistically at 1.0254 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.

✅ 2024-08-25:

WARNING: Asked to provide at most 400 sequences, but there are 432 groups.
Sampling probabilistically at 0.9033 sequences per group, meaning it is possible to have more than the requested maximum of 400 sequences after filtering.

corneliusroemer added the bug Something isn't working label Aug 23, 2024

victorlin self-assigned this Aug 23, 2024

corneliusroemer mentioned this issue Aug 23, 2024

Use exact fractional sequences per group #1599

Merged

4 tasks

victorlin mentioned this issue Aug 23, 2024

Simplify probabilistic sampling calculation #1588

Closed

victorlin closed this as completed in #1599 Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsample has uncaught assertion in ncov #1598

Subsample has uncaught assertion in ncov #1598

corneliusroemer commented Aug 23, 2024

victorlin commented Aug 23, 2024

corneliusroemer commented Aug 23, 2024

victorlin commented Aug 26, 2024

Subsample has uncaught assertion in ncov #1598

Subsample has uncaught assertion in ncov #1598

Comments

corneliusroemer commented Aug 23, 2024

victorlin commented Aug 23, 2024

corneliusroemer commented Aug 23, 2024

victorlin commented Aug 26, 2024