Fix probabilistic subsampling for small values #759

huddlej · 2021-08-13T22:12:35Z

Description of proposed changes

Fixes a regression from the rewrite of augur filter where probabilistic subsampling for small values of --subsample-max-sequences could randomly select zero strains and randomly cause our CI tests to fail. Prior to the rewrite of augur filter and introduction of priority queues, we fixed this issue by repeatedly attempting to calculate sequences per group that summed to an integer value greater than zero. However, the way I implemented random queue sizes inside the PriorityQueue class in the rewrite prevented me from using a similar "multiple attempts" approach.

This commit redesigns the way we create priority queues. In the case where we already know the number of sequences per group in the first pass, we create an appropriately-sized priority queue for each group as we encounter it. There is no possibility that the sum of these queue sizes could be zero.

In the case where we need to calculate the number of sequences per group from the first pass, we already know all possible groups and can create their priority queues in bulk. The new create_queues_by_group function allows us to create fixed-sized or randomly-sized queues and also make multiple attempts when queue sizes sum to zero. As a result, the PriorityQueue class is much simpler (it requires no logic about random max sizes) and we can actually test the fixed and random behaviors more carefully with doctests for create_queues_by_group.

Testing

Add new doctests for the new create_queues_by_group function
Pass CI suite
Test with ncov workflow

Fixes a regression from the rewrite of augur filter where probabilistic subsampling for small values of `--subsample-max-sequences` could randomly select zero strains and randomly cause our CI tests to fail. Prior to the rewrite of augur filter and introduction of priority queues, we fixed this issue by repeatedly attempting to calculate sequences per group that summed to an integer value greater than zero. However, the way I implemented random queue sizes inside the `PriorityQueue` class in the rewrite prevented me from using a similar "multiple attempts" approach. This commit redesigns the way we create priority queues. In the case where we already know the number of sequences per group in the first pass, we create an appropriately-sized priority queue for each group as we encounter it. There is no possibility that the sum of these queue sizes could be zero. In the case where we need to calculate the number of sequences per group from the first pass, we already know all possible groups and can create their priority queues in bulk. The new `create_queues_by_group` function allows us to create fixed-sized or randomly-sized queues and also make multiple attempts when queue sizes sum to zero. As a result, the `PriorityQueue` class is much simpler (it requires no logic about random max sizes) and we can actually test the fixed and random behaviors more carefully with doctests for `create_queues_by_group`.

codecov · 2021-08-13T22:15:14Z

Codecov Report

Merging #759 (5db04fc) into master (ae80716) will increase coverage by 0.10%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##           master     #759      +/-   ##
==========================================
+ Coverage   33.55%   33.66%   +0.10%     
==========================================
  Files          41       41              
  Lines        5871     5882      +11     
  Branches     1419     1423       +4     
==========================================
+ Hits         1970     1980      +10     
- Misses       3816     3817       +1     
  Partials       85       85

Impacted Files	Coverage Δ
augur/filter.py	`68.02% <80.00%> (+0.49%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae80716...5db04fc. Read the comment docs.

huddlej merged commit 52a41c3 into master Aug 13, 2021

huddlej deleted the fix-probabilistic-sampling branch August 13, 2021 23:12

huddlej added this to the Patch release 12.1.1 milestone Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix probabilistic subsampling for small values #759

Fix probabilistic subsampling for small values #759

huddlej commented Aug 13, 2021 •

edited

Loading

codecov bot commented Aug 13, 2021 •

edited

Loading

Fix probabilistic subsampling for small values #759

Fix probabilistic subsampling for small values #759

Conversation

huddlej commented Aug 13, 2021 • edited Loading

Description of proposed changes

Testing

codecov bot commented Aug 13, 2021 • edited Loading

Codecov Report

huddlej commented Aug 13, 2021 •

edited

Loading

codecov bot commented Aug 13, 2021 •

edited

Loading