Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with reproducible subsampling in filter #772

Merged
merged 1 commit into from
Sep 23, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Sep 23, 2021

Addresses three cases of non-deterministic data processing in the filter module that caused output for the same command with a random seed to change between runs. The three main cases are:

  1. When adding strains to priority queues, the non-deterministic order of strains stored in sets causes deterministic "random" priorities for a given random seed to be assigned to different strains.

  2. When creating randomly-sized priority queues by sampling from a Poisson distribution, that sampling did not use the user-provided random seed and would produce randomly-sized queues.

  3. Groups passed to the create_queues_by_group function (used when the user provides --subsample-max-sequences) were randomly ordered (due to random order of strains stored in sets) causing different groups to get different sized queues between runs (even when queue size was stable across runs).

This commit adds unit and functional tests to catch each of these cases and adds the code changes required for these tests to pass.

Thank you to Philip Shirk for catching this issue and describing it on
the Nextstrain discussion site [1].

Fixes #770

[1] https://discussion.nextstrain.org/t/augur-filter-subsample-seed-reproducible-example/723

@codecov
Copy link

codecov bot commented Sep 23, 2021

Codecov Report

Merging #772 (f648429) into master (e772855) will decrease coverage by 0.01%.
The diff coverage is 42.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #772      +/-   ##
==========================================
- Coverage   33.76%   33.75%   -0.02%     
==========================================
  Files          41       41              
  Lines        5896     5898       +2     
  Branches     1427     1427              
==========================================
  Hits         1991     1991              
- Misses       3819     3821       +2     
  Partials       86       86              
Impacted Files Coverage Δ
augur/filter.py 68.04% <42.85%> (-0.26%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e772855...f648429. Read the comment docs.

tests/functional/filter.t Outdated Show resolved Hide resolved
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me! Thanks for addressing it.

Addresses three cases of non-deterministic data processing in the filter
module that caused output for the same command with a random seed to
change between runs. The three main cases are:

1. When adding strains to priority queues, the non-deterministic order
of strains stored in sets causes deterministic "random" priorities for a
given random seed to be assigned to different strains.

2. When creating randomly-sized priority queues by sampling from a
Poisson distribution, that sampling did not use the user-provided random
seed and would produce randomly-sized queues.

3. Groups passed to the `create_queues_by_group` function (used when the
user provides `--subsample-max-sequences`) were randomly ordered (due to
random order of strains stored in sets) causing different groups to get
different sized queues between runs (even when queue size was stable
across runs).

This commit adds unit and functional tests to catch each of these cases
and adds the code changes required for these tests to pass.

Thank you to Philip Shirk for catching this issue and describing it on
the Nextstrain discussion site [1].

[1] https://discussion.nextstrain.org/t/augur-filter-subsample-seed-reproducible-example/723
@huddlej huddlej force-pushed the reproducible-subsampling branch from f54ab6a to f648429 Compare September 23, 2021 23:09
@huddlej huddlej merged commit 17f3623 into master Sep 23, 2021
@huddlej huddlej deleted the reproducible-subsampling branch September 23, 2021 23:53
@huddlej huddlej added this to the Patch release 13.0.1 milestone Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[filter]: Probalistic sampling doesn't use the given RNG seed
2 participants