-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter: Errors out when there are more groups than requested sequences #654
Comments
I definitely agree we should probably change the error message to advise users to run again, using For having 208 groups and wanting 200 samples, this is pretty harmless, but if there were 500 groups and you wanted 200 samples, you could end up with pretty different sampling than you intended. I'm not sure that we want that to happen silently? But we could definitely make it much easier for users to figure out whether they wither want to re-think their groupings & continue with what they ask for, or whether they want to keep everything else 'as-is' and add |
The thinking back then was that if the user requests fewer sequences with categories, that is an unfulfillable request should result in an error: |
I think we should get the probablistic subsampling released and flesh out the error message. We could also default to probabilistic sampling if there are too many groups. Seems less surprising then taking the first |
I see what you mean @rneher and @emmahodcroft. Immediately I certainly agree with updating the error message to recommend probabilistic subsampling. However, what happens now if you have a situation that looks like:
Ie if we allow Actually, I'm now confused on why the
isn't a lot to go on. It's also confusing that |
I don't think that is what happens. I the former case, we get 150 sequences, one from each group. In the latter case, we get an error.
mainly so that we don't change default behavior as this is an addition to what was there before. But I agree that in most cases prob sampling is probably the desired behavior. |
Got it... this makes much more sense to me now. A couple thoughts then. I would have been less confused if instead of calling this Additionally, I don't like a flag like this for something we probably want as default behavior. We'll be stuck with everyone writing Here, my thinking is either:
|
One concern was that it's easy for users to unknowingly request more groups than there are total samples requested. The only way users would find out that this is happening is for Augur to throw an error. If the "probabilistic sampling" behavior should be the default, though, is there really a use case for enforcing the integer group sizes? This opt-in flag would require users to know enough about the subsampling logic to override the fractional group sizes. I personally have to re-read this subsampling code to remember how the logic works. I wonder how many users would care to have this option... As another way forward could we just do:
Then, if people find a need to enforce the old behavior, we could add a flag to toggle it in a later minor release. Separately, it would be nice to have a user interface to explore different subsampling schemes interactively for a given metadata file. Then users could be more confident about the composition of their builds. This probably relates to @jameshadfield's RFC for an augur subsample command. |
I am not a sure |
Ok, I just released augur v10.2.0 with probabilistic sampling. |
That would be brilliant John, and really help out our shepherds! Thank you! |
So in Augur v11 when |
We can keep the flag in v11, label it as deprecated, and remove it in the next major version (v12). I think this is what we did with It sounds like we want to push for this new default behavior sooner than later, though. If so, we could plan to release v11 with this change next week. |
Makes sense. Copying @rneher's Slack comment here as well:
|
So, I accidentally pushed commits that should have been in a branch to Besides addressing any errors that might arise when CI tests finish, do folks have any comments about these changes? |
No worries at all. Thanks for these additions John. My feeling is that it's legitimate to have both |
I just reverted those two commits on |
Current Behavior
I encountered this when running the latest
ncov
global build. It includes the command:This is requesting 200 sequences from the "early" timepoint for Europe. However, this errors out with:
If I understand #629 correctly, I can fix this with
--probabilistic-sampling
. Okay, and this indeed fixes the issue. However... I believe this shouldn't error out without--probabilistic-sampling
. It's a surprise to the user and tricky to debug.Expected behavior
I would definitely suggest to just take the first 200 groups deterministically when
--probabilistic-sampling
is not set. This will be nicer for users.I realize this is somewhere between a bug and a feature request.
The text was updated successfully, but these errors were encountered: