-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Reduce over-sampling in partial months with --group-by month
#960
Comments
I don't think
This seems right to me. It is fairly straightforward to enable |
I definitely take your point on
for current Europe-focused ncov builds. With 6 month focus we have 6 months x 46 countries = 276 categories. If this was days, we'd have 180 days x 46 countries = 8280 categories. I believe (but could be confused) that by random picking among the 8280 we'd be biasing towards temporal diversity and away from geographic diversity relative to the 276 category scenario. Ie with ~3000 tips in the 276 category scenario you'd have ~11 per country and ~2 per month pretty systematically. But in the 8280 category scenario, I'd think that stochastically you might have different counts per county as each category would be picked ~1/3 of the time. (I might be thinking about this wrong, feel like I'd want to test to confirm) |
Group by day is not good, because daily sequencing volumne varies a lot whereas weekly volumne does not. There's not much collection on Saturdays, Sundays, etc. Weekly is the right way to go for now - definitely better than just monthly. Sorry I only see this now. |
Context
@trvrb from nextstrain/ncov#957:
Example
When requesting
--subsample-max-sequences
, this will evenly sample from the 3 groups2022-03
,2022-04
,2022-05
. However, note that the--min-date
and--max-date
make the sampling window to be half of2022-03
, all of2022-04
, and half of2022-05
. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).The text was updated successfully, but these errors were encountered: