Fix issues with reproducible subsampling in filter #772
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses three cases of non-deterministic data processing in the filter module that caused output for the same command with a random seed to change between runs. The three main cases are:
When adding strains to priority queues, the non-deterministic order of strains stored in sets causes deterministic "random" priorities for a given random seed to be assigned to different strains.
When creating randomly-sized priority queues by sampling from a Poisson distribution, that sampling did not use the user-provided random seed and would produce randomly-sized queues.
Groups passed to the
create_queues_by_group
function (used when the user provides--subsample-max-sequences
) were randomly ordered (due to random order of strains stored in sets) causing different groups to get different sized queues between runs (even when queue size was stable across runs).This commit adds unit and functional tests to catch each of these cases and adds the code changes required for these tests to pass.
Thank you to Philip Shirk for catching this issue and describing it on
the Nextstrain discussion site [1].
Fixes #770
[1] https://discussion.nextstrain.org/t/augur-filter-subsample-seed-reproducible-example/723