-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow --subsample-max-sequences
without --group-by
#710
Conversation
…ies are defined in group-by
Codecov Report
@@ Coverage Diff @@
## master #710 +/- ##
==========================================
- Coverage 31.31% 31.01% -0.30%
==========================================
Files 41 41
Lines 5649 5716 +67
Branches 1365 1400 +35
==========================================
+ Hits 1769 1773 +4
- Misses 3806 3864 +58
- Partials 74 79 +5
Continue to review full report at Codecov.
|
@huddlej, this is a first take on resolving issue #680
the number of strains dropped during filtering varied randomly from run to run. Is this behaviour intended or do I need to further look into how filter and subsample works when using this dummy category? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, @benjaminotter, and works for me locally. I made a couple of minor requests inline. In addition to addressing those, could you also add a new functional test for this new feature to tests/functional/filter.t
? These tests are written for the Cram testing tool and follow a simple syntax kind of like Python doctests for the command line. You can copy one of the existing filter tests that uses --subsample-max-sequences
, drop the --group-by
part of the test, and update the expected output accordingly. You'll want to keep the --no-probabilistic-sampling
flag for reasons I describe below.
Could you also update the help text for the --subsample-max-sequences
argument to mention that this argument can be used without the --group-by
argument?
the number of strains dropped during filtering varied randomly from run to run. Is this behaviour intended or do I need to further look into how filter and subsample works when using this dummy category?
This behavior reflects the way "probabilistic sampling" works. Probabilistic sampling determines the number of sequences per group by sampling from a Poisson with a mean of the requested number of sequences per group. This means there can be exactly as many sequences as you request (or more) available, but the Poisson sampling will randomly request fewer or more sequences. It's not an unexpected outcome, but I personally think it reflects a bug. Resolving that issue is outside the scope of this PR, though.
…ategory. Removes obsolete checks for _dummy category
… new functionality
@huddlej, thank you for the feedback! I updated the code according to your requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks again, @benjaminotter.
Description of proposed changes
Allows subsampling in the case where
--subsample-max-sequences
is set but--group-by
is not.As suggested in issue #680 all samples are grouped into a dummy category where they are subsampled from.
Related issue(s)
Fixes #680
Testing