Apply filters and group-by logic to a metadata iterator #750

huddlej · 2021-07-23T23:31:53Z

Description of proposed changes

This PR rewrites the logic for filtering records to avoid loading all metadata into memory at once. To achieve this goal, we filter metadata in one or two passes through the full data, iterating through a fixed number of chunks in memory at any given time. In the best case scenario (filtering only or grouping by columns with predefined number of sequences per group), we only need one pass through the metadata. During this pass, we can stream records that pass all filters to disk or store the highest priority records in a (fixed size) priority queue to be written later.

In the worst case scenario (grouping by columns with a maximum number of total sequences), we need the first pass to calculate the number of sequences per group and a second pass to select the highest priority sequences in each group.

In addition to this major change in how the filter logic works, this PR also makes the following changes:

Adds an --output-log argument that accepts a tab-delimited file that will contain each filtered strain and its reason for being filtered (or force-included)
Adds a --metadata-chunksize argument to control the number of metadata records loaded into memory at a time. This argument allows users to control memory use by Augur to match their compute environment.
Defaults to non-probabilistic subsampling when possible and only switches to probabilistic sampling with --probabilistic-sampling is True and the non-probabilistic number of sequences per group cannot be calculated.
Adds support for indexing VCF files with augur index and from the Augur Python API with a new index_vcf function. These interfaces allow VCFs to be treated the same as FASTA files when filtering by sequence presence/absence.
Removes a check for strains actually present in input sequences (when provided) and relies entirely on the sequence index to filter metadata and strain outputs (this is a breaking change from previous behavior but only affects the use case when a provided sequence index is out of date relative to the provided sequences).
Removes ordering of strain names prior to output via --output-strains.

Related issue(s)

Fixes #699
Fixes #424

TODO

Support subsampling by priority
Fix pandas read_csv iteration in Python 3.6
Support probabilistic sampling using subsample seed argument
Log strains filtered by subsampling
Report exclude/include files separately in the log output
Rename include_by_query to include_by_include_where
Rewrite heap logic as a class to support counter/random number generator state

Maybe for later PRs:

Clean up and refactor code in run that could be in its own functions (e.g., setup grouping, write metadata, etc.)
Standardize naming of "strains", "sequences", and "records"

Steps for later PRs:

Testing

codecov · 2021-07-23T23:33:21Z

Codecov Report

Merging #750 (bc076d4) into master (653079e) will increase coverage by 1.93%.
The diff coverage is 58.52%.

@@            Coverage Diff             @@
##           master     #750      +/-   ##
==========================================
+ Coverage   31.52%   33.46%   +1.93%     
==========================================
  Files          41       41              
  Lines        5754     5893     +139     
  Branches     1389     1431      +42     
==========================================
+ Hits         1814     1972     +158     
+ Misses       3865     3837      -28     
- Partials       75       84       +9

Impacted Files	Coverage Δ
augur/index.py	`64.86% <10.00%> (+12.23%)`	⬆️
augur/filter.py	`67.72% <58.63%> (+15.64%)`	⬆️
augur/utils.py	`41.59% <91.66%> (+1.41%)`	⬆️
augur/io.py	`100.00% <100.00%> (ø)`
augur/export_v2.py	`11.04% <0.00%> (-0.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 653079e...bc076d4. Read the comment docs.

Reorganizes filtering logic such that we first create a list of filters to apply through a `construct_filters` function and then apply those filters through a new `apply_filters` function that loops through each filter. This new function returns a tuple of strains in three categories of strains to keep, strains to exclude, and strains to force include. This function can be applied to a full metadata file or individual chunks of metadata. The lists of excluded and force-included strains also include the reasons why these strains were filtered. The rest of this commit updates the corresponding calling code in `run` to use these new functions. One intended benefit of this new approach to filtering is the ability to log the reason why each strain was filtered or force-included as an additional log output file, `--output-log`. This commit also fixes a subtle bug in the `filter_by_date` function where the function accepts a user-defined date column but was not passing this column to the `get_numerical_dates` function.

Instead of loading all metadata into memory at once, iterate through fixed-size chunks of data, applying filters to these data and either grouping data into priority queues or streaming output to disk, as needed. This approach generally follows the original pseudo-code solution [1]. Users can override the default chunk size with the new `--metadata-chunksize` argument to tune the amount of memory used by a given execution of the filter command. A larger chunk size uses more memory but may also run slightly faster. One side-effect of this implementation is that it allows us to log the reason why each strain was filtered or force-included in a new `--output-log` argument. One of the output columns of the log file is a kwargs column that tracks the argument passed to a given filter. This column is structured text in JSON format which allows for more sophisticated reporting by specific keys. Along with this change, we apply the include/exclude logic from files per file so we can track which specific file was responsible for including or filtering each strain. Note that we don't use context manager for CSV reading here. In version 1.2, pandas.read_csv was updated to act as a context manager when `chunksize` is passed but this same version increased the minimum Python version supported to 3.7. As a result, pandas for Python 3.6 does not support the context manager `with` usage. Here, we always iterate through the `TextFileReader` object instead of using the context manager, an approach that works in all Python versions. Finally, this commit changes the subsample seed argument type to an `int` instead of string or int to match numpy's requirement for its random generator seed [2]. We do not pass a seed value to numpy random generator prior to Poisson sampling or the generator will always sample the same values for a given mean (i.e., all queues will have the same size). Use the random seed for generating random priorities when none are provided by the user. Fixes #424 [1] #699 (comment) [2] https://numpy.org/doc/stable/reference/random/generator.html

Adds the ability to filter VCFs by a sequence index, to standardize the filtering interface for both FASTA and VCF inputs. This works by adding a new `index_vcf` function to the `index` module along with support for indexing VCFs with `augur index`. This function only reports the strain names in the given VCF. We then update the logic of the `filter` module to build a sequence index when the input is a VCF file and use this to filter the corresponding metadata. This approach allows us to filter VCF "sequences" without adding a special VCF-specific filter.

Allow calling code to specify its own valid index columns. This should enable using `read_metadata` to parse GISAID metadata, for example, or any other metadata with an identifier stored in a column other than Augur's defaults of "strain" and "name".

A quick attempt to support GISAID metadata prior to sanitizing, this adds a way to pass names of columns to use for an index (e.g., "strain", "name", or "Virus name").

huddlej · 2021-07-31T01:06:39Z

@tsibley This PR supersedes #749 but adds the remainder of the work required to fully reduce memory usage. I tried to group commits into logical chunks (rebasing), but a couple are unavoidably large. Nearly all commits could be squashed after review, but I left a couple unsquashed that might be contentious or at least worth discussing separately.

It might be easiest to walk through this in person like we did with previous PRs, but if you prefer to scan it on your own that's cool, too.

I'll also post some benchmarking results here later when I've finalized those.

augur/io.py

augur/filter.py

augur/index.py

huddlej · 2021-08-05T18:58:33Z

Some thoughts from chatting with Tom today:

The current implementation of two passes through metadata assumes the input can be re-read, but this will not be true for streaming input. To support streaming input, we should stream the metadata to a temporary local file in the first pass and read from this file in the second pass.
- Since we don't immediately need to support streaming inputs, we can keep the current implementation and add that support in a later PR.
- The sanitize metadata script in the ncov workflow will need to use a similar spool-to-disk approach because it transforms and deduplicates records. I will test out a generic implementation of the required logic in that script and then port it into augur filter when we've solidified the implementation in ncov. (This has been our pattern for most new functionality in augur lately anyway.)
We should consider what interface (i.e., argument/flag names) we want for Augur and then map these names to underlying technology like pandas instead of echoing the names used by pandas. This gives us flexibility to swap out pandas for something else eventually (at least, in some ways; we still refer to .loc, .index.values, etc. everywhere) and also allows us to name the interface in a way that matches our users' preferred nomenclature. For example, instead of referring to "metadata index columns" we could refer to "metadata unique id columns".
We do not need to add support for indexing VCFs right now, so the current approach in this PR is more than what's necessary to address the original memory issue. However, we should discuss as a group whether we want to support a "sequence index" for VCFs (something that is technically possible, given the original reference FASTA) because current users would benefit from it or we might in the future (e.g., if we move the ncov workflow to VCF from FASTA MSAs someday).

augur/filter.py

Skip building a sequence index when we know the user is excluding all strains. The only way to add strains back in with this flag are the `--include` or `--include-where` options, so we know we don't need a sequence index to apply any additional filters.

When sequences are provided, we have filtered by the strains present in these inputs even if users have not requested sequence outputs. This simplifies the decision about whether to build a sequence index to the case when a sequence index isn't provided but sequences are. We still handle the exception where the `--exclude-all` flag skips the sequence index building step.

Replace open-ended kwargs with specific keyword arguments for `read_metadata` and name these arguments (and the corresponding augur filter CLI arguments) to be less pandas-specific and closer to what we'd call these variables elsewhere in Augur (e.g., "chunksize" becomes "chunk_size"). As part of this renaming, this commit also fixes a bug in the dtype assignment for the pandas CSV reader by only specifying a dtype for the index column we end up using. This commit also adds a couple of simple doctests for `read_metadata` to clarify its expected behavior under common use cases.

Listing both the given id columns and the found metadata columns makes it easier for someone to debug directly from the error message. Switches from a KeyError to plain Exception since KeyError values should be just the key name itself. Includes the full error message in the doctest as an example of what to expect (and to make sure that our interpolations are working as intended).

It's worthwhile to test exceptions are actually formatted as intended.

tsibley

New changes look better, I agree. I had one small comment about an error message, but instead of leaving a comment, I pushed a commit (0be8869) with my proposal. I also pushed a tangential change (562b0e4) that I noticed while doing the error message work.

I'll also note what I already noted in Slack: One thing that jumped out is that I'd have broken e975fef up into three commits to make each distinct change smaller. One tell I use when thinking about how to chunk up commits is if I'm tempted to use "also" when describing the changes. That commit uses "also" twice, to introduce two additional changes. :-)

We now filter by observed sequences by always creating a sequence index when sequence data are provided.

huddlej changed the title ~~Filter chunks~~ Apply filters and group-by logic to a metadata iterator Jul 23, 2021

This was referenced Jul 23, 2021

Reduce metadata memory usage in augur filter #699

Closed

Apply filters in a loop #749

Closed

huddlej force-pushed the filter-chunks branch from 80278ba to f8d8a37 Compare July 29, 2021 21:58

huddlej added 6 commits July 30, 2021 16:27

Clean up comments, naming, and other bits

4a7f378

Add argument for metadata index columns

a625e31

A quick attempt to support GISAID metadata prior to sanitizing, this adds a way to pass names of columns to use for an index (e.g., "strain", "name", or "Virus name").

huddlej force-pushed the filter-chunks branch from f8d8a37 to a625e31 Compare July 31, 2021 00:05

huddlej marked this pull request as ready for review July 31, 2021 01:03

huddlej requested a review from tsibley July 31, 2021 01:04

huddlej commented Aug 5, 2021

View reviewed changes

augur/io.py Outdated Show resolved Hide resolved

huddlej commented Aug 5, 2021

View reviewed changes

augur/io.py Outdated Show resolved Hide resolved

huddlej commented Aug 5, 2021

View reviewed changes

augur/filter.py Outdated Show resolved Hide resolved

huddlej commented Aug 5, 2021

View reviewed changes

augur/index.py Show resolved Hide resolved

huddlej commented Aug 6, 2021

View reviewed changes

augur/filter.py Outdated Show resolved Hide resolved

huddlej and others added 5 commits August 6, 2021 10:56

Remove a doctest: +IGNORE_EXCEPTION_DETAIL directive

562b0e4

It's worthwhile to test exceptions are actually formatted as intended.

tsibley approved these changes Aug 9, 2021

View reviewed changes

Remove comment to filter by observed sequences

bc076d4

We now filter by observed sequences by always creating a sequence index when sequence data are provided.

huddlej merged commit 3931d5d into master Aug 12, 2021

huddlej deleted the filter-chunks branch August 12, 2021 16:46

huddlej added this to the Feature release 13.0.0 milestone Aug 12, 2021

victorlin mentioned this pull request Mar 29, 2022

Standardize naming of strains/sequences/records #877

Open

huddlej mentioned this pull request Mar 8, 2024

filter: Split filtering and subsampling #1432

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply filters and group-by logic to a metadata iterator #750

Apply filters and group-by logic to a metadata iterator #750

huddlej commented Jul 23, 2021 •

edited

Loading

codecov bot commented Jul 23, 2021 •

edited

Loading

huddlej commented Jul 31, 2021

huddlej commented Aug 5, 2021

tsibley left a comment •

edited

Loading

Apply filters and group-by logic to a metadata iterator #750

Apply filters and group-by logic to a metadata iterator #750

Conversation

huddlej commented Jul 23, 2021 • edited Loading

Description of proposed changes

Related issue(s)

TODO

Testing

codecov bot commented Jul 23, 2021 • edited Loading

Codecov Report

huddlej commented Jul 31, 2021

huddlej commented Aug 5, 2021

tsibley left a comment • edited Loading

Choose a reason for hiding this comment

huddlej commented Jul 23, 2021 •

edited

Loading

codecov bot commented Jul 23, 2021 •

edited

Loading

tsibley left a comment •

edited

Loading