Skip to content

Commit

Permalink
Fix missing groups for included strains
Browse files Browse the repository at this point in the history
Fixes a subtle bug when grouping with subsampled max sequences and
force-included strains that could potentially be considered for
subsampling. In the first pass through the metadata, we remove
force-included strains from consideration for subsampling. However, we
do not similarly remove these strains in the second pass through the
metadata. If these strains have not been filtered earlier, we consider
them for subsampling in the second pass and look for a queue with their
group-by attributes. Since queue creation depends on the group
attributes we find in the first pass, it is possible for the group
attributes of the force-included strains to not have a queue.

This commit adds a functional test to recreate the bug (originally
discovered in the ncov workflow) and updates the logic in the second
pass to remove force-included strains prior to building subsampling
groups.
  • Loading branch information
huddlej committed Jul 6, 2022
1 parent 0d06558 commit 22b1685
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 0 deletions.
5 changes: 5 additions & 0 deletions augur/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -1547,6 +1547,11 @@ def run(args):
# during the first pass, but we want to minimize overall memory
# usage at the moment.
seq_keep = set(metadata.index.values) & valid_strains

# Prevent force-included strains from being considered in this
# second pass, as in the first pass.
seq_keep = seq_keep - all_sequences_to_include

group_by_strain, skipped_strains = get_groups_for_subsampling(
seq_keep,
metadata,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Setup

$ pushd "$TESTDIR" > /dev/null
$ source _setup.sh

Do not consider force-included strains for subsampling.
In this test, we force-include two old strains that are the only representatives of their month/year date group (December 2015).
We don't filter these strains, so they could be considered for subsampling, but Augur removes them from consideration because they have been force-included.

$ ${AUGUR} filter \
> --metadata filter/data/metadata.tsv \
> --include filter/data/include_old_strains.txt \
> --group-by month year \
> --subsample-max-sequences 10 \
> --output-metadata $TMP/metadata-filtered.tsv > /dev/null
$ rm -f $TMP/metadata-filtered.tsv
2 changes: 2 additions & 0 deletions tests/functional/filter/data/include_old_strains.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
PRVABC59
COL/FLR_00008/2015

0 comments on commit 22b1685

Please sign in to comment.