Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow --subsample-max-sequences without --group-by #710

Merged
merged 4 commits into from
Apr 9, 2021

Conversation

benjaminotter
Copy link
Contributor

Description of proposed changes

Allows subsampling in the case where --subsample-max-sequences is set but --group-by is not.
As suggested in issue #680 all samples are grouped into a dummy category where they are subsampled from.

Related issue(s)

Fixes #680

Testing

@benjaminotter benjaminotter requested a review from huddlej April 7, 2021 13:09
@codecov
Copy link

codecov bot commented Apr 7, 2021

Codecov Report

Merging #710 (bb0e824) into master (24baeaf) will decrease coverage by 0.29%.
The diff coverage is 8.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #710      +/-   ##
==========================================
- Coverage   31.31%   31.01%   -0.30%     
==========================================
  Files          41       41              
  Lines        5649     5716      +67     
  Branches     1365     1400      +35     
==========================================
+ Hits         1769     1773       +4     
- Misses       3806     3864      +58     
- Partials       74       79       +5     
Impacted Files Coverage Δ
augur/filter.py 48.63% <8.33%> (-0.62%) ⬇️
augur/tree.py 8.55% <0.00%> (-0.63%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24baeaf...bb0e824. Read the comment docs.

@benjaminotter
Copy link
Contributor Author

@huddlej, this is a first take on resolving issue #680
While testing the new functionality with the code provided in the zika-tutorial, modified to use --subsample-max-sequences without --group-by:

augur filter \
  --sequences data/sequences.fasta \
  --sequence-index results/sequence_index.tsv \
  --metadata data/metadata.tsv \
  --exclude config/dropped_strains.txt \
  --output results/filtered.fasta \
  --subsample-max-sequences 20 \
  --min-date 2012

the number of strains dropped during filtering varied randomly from run to run. Is this behaviour intended or do I need to further look into how filter and subsample works when using this dummy category?

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, @benjaminotter, and works for me locally. I made a couple of minor requests inline. In addition to addressing those, could you also add a new functional test for this new feature to tests/functional/filter.t? These tests are written for the Cram testing tool and follow a simple syntax kind of like Python doctests for the command line. You can copy one of the existing filter tests that uses --subsample-max-sequences, drop the --group-by part of the test, and update the expected output accordingly. You'll want to keep the --no-probabilistic-sampling flag for reasons I describe below.

Could you also update the help text for the --subsample-max-sequences argument to mention that this argument can be used without the --group-by argument?

the number of strains dropped during filtering varied randomly from run to run. Is this behaviour intended or do I need to further look into how filter and subsample works when using this dummy category?

This behavior reflects the way "probabilistic sampling" works. Probabilistic sampling determines the number of sequences per group by sampling from a Poisson with a mean of the requested number of sequences per group. This means there can be exactly as many sequences as you request (or more) available, but the Poisson sampling will randomly request fewer or more sequences. It's not an unexpected outcome, but I personally think it reflects a bug. Resolving that issue is outside the scope of this PR, though.

augur/filter.py Outdated Show resolved Hide resolved
augur/filter.py Outdated Show resolved Hide resolved
augur/filter.py Outdated Show resolved Hide resolved
augur/filter.py Outdated Show resolved Hide resolved
@benjaminotter
Copy link
Contributor Author

@huddlej, thank you for the feedback! I updated the code according to your requested changes.
For the cram test, I used an existing test as a template and updated the expected output as you suggested, but I'm not sure if the simple test I added fully covers the new feature.

@huddlej huddlej marked this pull request as ready for review April 9, 2021 23:06
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again, @benjaminotter.

@huddlej huddlej added this to the Next release milestone Apr 9, 2021
@huddlej huddlej merged commit 6e5c9f7 into master Apr 9, 2021
@huddlej huddlej deleted the subsample-without-group-by branch April 9, 2021 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[augur filter] allow --subsample-max-sequences without --group-by
2 participants