Fix probabilistic subsampling for small values #759
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Fixes a regression from the rewrite of augur filter where probabilistic subsampling for small values of
--subsample-max-sequences
could randomly select zero strains and randomly cause our CI tests to fail. Prior to the rewrite of augur filter and introduction of priority queues, we fixed this issue by repeatedly attempting to calculate sequences per group that summed to an integer value greater than zero. However, the way I implemented random queue sizes inside thePriorityQueue
class in the rewrite prevented me from using a similar "multiple attempts" approach.This commit redesigns the way we create priority queues. In the case where we already know the number of sequences per group in the first pass, we create an appropriately-sized priority queue for each group as we encounter it. There is no possibility that the sum of these queue sizes could be zero.
In the case where we need to calculate the number of sequences per group from the first pass, we already know all possible groups and can create their priority queues in bulk. The new
create_queues_by_group
function allows us to create fixed-sized or randomly-sized queues and also make multiple attempts when queue sizes sum to zero. As a result, thePriorityQueue
class is much simpler (it requires no logic about random max sizes) and we can actually test the fixed and random behaviors more carefully with doctests forcreate_queues_by_group
.Testing
create_queues_by_group
function