Allow `--subsample-max-sequences` without `--group-by` #710

benjaminotter · 2021-04-07T13:09:24Z

Description of proposed changes

Allows subsampling in the case where --subsample-max-sequences is set but --group-by is not.
As suggested in issue #680 all samples are grouped into a dummy category where they are subsampled from.

Related issue(s)

Fixes #680

Testing

…ies are defined in group-by

codecov · 2021-04-07T13:10:57Z

Codecov Report

Merging #710 (bb0e824) into master (24baeaf) will decrease coverage by 0.29%.
The diff coverage is 8.33%.

@@            Coverage Diff             @@
##           master     #710      +/-   ##
==========================================
- Coverage   31.31%   31.01%   -0.30%     
==========================================
  Files          41       41              
  Lines        5649     5716      +67     
  Branches     1365     1400      +35     
==========================================
+ Hits         1769     1773       +4     
- Misses       3806     3864      +58     
- Partials       74       79       +5

Impacted Files	Coverage Δ
augur/filter.py	`48.63% <8.33%> (-0.62%)`	⬇️
augur/tree.py	`8.55% <0.00%> (-0.63%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24baeaf...bb0e824. Read the comment docs.

benjaminotter · 2021-04-07T13:18:38Z

@huddlej, this is a first take on resolving issue #680
While testing the new functionality with the code provided in the zika-tutorial, modified to use --subsample-max-sequences without --group-by:

augur filter \
  --sequences data/sequences.fasta \
  --sequence-index results/sequence_index.tsv \
  --metadata data/metadata.tsv \
  --exclude config/dropped_strains.txt \
  --output results/filtered.fasta \
  --subsample-max-sequences 20 \
  --min-date 2012

the number of strains dropped during filtering varied randomly from run to run. Is this behaviour intended or do I need to further look into how filter and subsample works when using this dummy category?

huddlej

This looks good, @benjaminotter, and works for me locally. I made a couple of minor requests inline. In addition to addressing those, could you also add a new functional test for this new feature to tests/functional/filter.t? These tests are written for the Cram testing tool and follow a simple syntax kind of like Python doctests for the command line. You can copy one of the existing filter tests that uses --subsample-max-sequences, drop the --group-by part of the test, and update the expected output accordingly. You'll want to keep the --no-probabilistic-sampling flag for reasons I describe below.

Could you also update the help text for the --subsample-max-sequences argument to mention that this argument can be used without the --group-by argument?

the number of strains dropped during filtering varied randomly from run to run. Is this behaviour intended or do I need to further look into how filter and subsample works when using this dummy category?

This behavior reflects the way "probabilistic sampling" works. Probabilistic sampling determines the number of sequences per group by sampling from a Poisson with a mean of the requested number of sequences per group. This means there can be exactly as many sequences as you request (or more) available, but the Poisson sampling will randomly request fewer or more sequences. It's not an unexpected outcome, but I personally think it reflects a bug. Resolving that issue is outside the scope of this PR, though.

augur/filter.py

…ategory. Removes obsolete checks for _dummy category

… new functionality

benjaminotter · 2021-04-08T09:19:25Z

@huddlej, thank you for the feedback! I updated the code according to your requested changes.
For the cram test, I used an existing test as a template and updated the expected output as you suggested, but I'm not sure if the simple test I added fully covers the new feature.

huddlej

LGTM! Thanks again, @benjaminotter.

Allows subsampling when subsample-max-sequences is set but no categor…

70ec3da

…ies are defined in group-by

benjaminotter requested a review from huddlej April 7, 2021 13:09

huddlej requested changes Apr 7, 2021

View reviewed changes

augur/filter.py Outdated Show resolved Hide resolved

augur/filter.py Outdated Show resolved Hide resolved

augur/filter.py Outdated Show resolved Hide resolved

augur/filter.py Outdated Show resolved Hide resolved

benjaminotter added 3 commits April 8, 2021 09:57

Removes commented code and renames dummy category to _dummy

243e588

Refactores the conditional for missing_cats to never include _dummy c…

1c670e4

…ategory. Removes obsolete checks for _dummy category

Updates help text for subsample-max-sequences. Adds cram test for the…

bb0e824

… new functionality

huddlej marked this pull request as ready for review April 9, 2021 23:06

huddlej approved these changes Apr 9, 2021

View reviewed changes

huddlej added this to the Next release milestone Apr 9, 2021

huddlej merged commit 6e5c9f7 into master Apr 9, 2021

huddlej deleted the subsample-without-group-by branch April 9, 2021 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `--subsample-max-sequences` without `--group-by` #710

Allow `--subsample-max-sequences` without `--group-by` #710

benjaminotter commented Apr 7, 2021

codecov bot commented Apr 7, 2021 •

edited

Loading

benjaminotter commented Apr 7, 2021

huddlej left a comment

benjaminotter commented Apr 8, 2021

huddlej left a comment

Allow --subsample-max-sequences without --group-by #710

Allow --subsample-max-sequences without --group-by #710

Conversation

benjaminotter commented Apr 7, 2021

Description of proposed changes

Related issue(s)

Testing

codecov bot commented Apr 7, 2021 • edited Loading

Codecov Report

benjaminotter commented Apr 7, 2021

huddlej left a comment

Choose a reason for hiding this comment

benjaminotter commented Apr 8, 2021

huddlej left a comment

Choose a reason for hiding this comment

Allow `--subsample-max-sequences` without `--group-by` #710

Allow `--subsample-max-sequences` without `--group-by` #710

codecov bot commented Apr 7, 2021 •

edited

Loading