Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry probabilistic subsampling after failure #676

Merged
merged 1 commit into from
Feb 14, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Feb 12, 2021

Description of proposed changes

Probabilistic subsampling can randomly fail to select any samples when the requested number of maximum samples is very small. To minimize the chances of this edge case (that also happens to affect our Travis CI tests), this PR places the probabilistic subsampling inside a loop that checks for non-zero list of selected samples and retries subsampling a fixed number of times before giving up.

With 10 maximum retries, subsampling never fails under the conditions we use in our CI test (10 groups, 5 max sequences) when tested 10,000,000 times. The resulting probability of failure is low enough that this change should resolve our flaky CI test and also prevent most users from ever encountering the same error in their own analyses.

Note that this change should not affect most people under most conditions and will mostly make Augur development more sane.

Related issue(s)

Fixes #674

Testing

I confirmed the selected maximum number of retry attempts with the following code. This code tries to recreate the effect of running augur filter 10,000,000 times to see if subsampling ever fails under the conditions used by our CI test.

import numpy as np
r = np.random.default_rng()

# Calculated sequences per group for 10 groups and
# 5 maximum sequences.
spg = 0.4882902734375
n_groups = 10
max_retry_attempts = 10

# Repeat subsampling logic many times to replicate
# Travis CI runs, for example.
for j in range(10000000):
    # Try to sample a non-zero length list for the given
    # number of groups and sequences per group. Retry at
    # most a fixed number of times.
    if not any(((r.poisson(spg, size=n_groups) > 0).sum() > 0 for i in range(max_retry_attempts))):
        print(f"Failed at {j}")
        break

Probabilistic subsampling can randomly fail to select any samples when
the requested number of maximum samples is very small. To minimize the
chances of this edge case (that also happens to affect our Travis CI
tests), this commit places the probabilistic subsampling inside a loop
that checks for non-zero list of selected samples and retries
subsampling a fixed number of times before giving up.

With 10 maximum retries, subsampling never fails under the conditions we
use in our CI test (10 groups, 5 max sequences) when tested 10,000,000
times. The resulting probability of failure is low enough that this
change should resolve our flaky CI test and also prevent most users from
ever encountering the same error in their own analyses.

Fixes #674
@huddlej huddlej added this to the Feature release 11.1.0 milestone Feb 12, 2021
@codecov
Copy link

codecov bot commented Feb 12, 2021

Codecov Report

Merging #676 (68cd65a) into master (3574708) will decrease coverage by 0.00%.
The diff coverage is 7.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #676      +/-   ##
==========================================
- Coverage   30.35%   30.34%   -0.01%     
==========================================
  Files          40       40              
  Lines        5545     5549       +4     
  Branches     1348     1349       +1     
==========================================
+ Hits         1683     1684       +1     
- Misses       3802     3805       +3     
  Partials       60       60              
Impacted Files Coverage Δ
augur/filter.py 47.04% <7.69%> (-0.26%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3574708...68cd65a. Read the comment docs.

@huddlej huddlej requested a review from rneher February 12, 2021 19:17
@huddlej huddlej self-assigned this Feb 12, 2021
@rneher
Copy link
Member

rneher commented Feb 13, 2021

This looks good to me. One could probably prevent this by a more elegant sampling, but this will do the trick for most cases.

@huddlej huddlej merged commit f256227 into master Feb 14, 2021
@huddlej huddlej deleted the retry-probabilistic-sampling branch February 14, 2021 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filter: Probabilistic subsampling randomly fails with small max sequences
2 participants