Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply filters and group-by logic to a metadata iterator #750

Merged
merged 12 commits into from
Aug 12, 2021
Merged

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jul 23, 2021

Description of proposed changes

This PR rewrites the logic for filtering records to avoid loading all metadata into memory at once. To achieve this goal, we filter metadata in one or two passes through the full data, iterating through a fixed number of chunks in memory at any given time. In the best case scenario (filtering only or grouping by columns with predefined number of sequences per group), we only need one pass through the metadata. During this pass, we can stream records that pass all filters to disk or store the highest priority records in a (fixed size) priority queue to be written later.

In the worst case scenario (grouping by columns with a maximum number of total sequences), we need the first pass to calculate the number of sequences per group and a second pass to select the highest priority sequences in each group.

In addition to this major change in how the filter logic works, this PR also makes the following changes:

  • Adds an --output-log argument that accepts a tab-delimited file that will contain each filtered strain and its reason for being filtered (or force-included)
  • Adds a --metadata-chunksize argument to control the number of metadata records loaded into memory at a time. This argument allows users to control memory use by Augur to match their compute environment.
  • Defaults to non-probabilistic subsampling when possible and only switches to probabilistic sampling with --probabilistic-sampling is True and the non-probabilistic number of sequences per group cannot be calculated.
  • Adds support for indexing VCF files with augur index and from the Augur Python API with a new index_vcf function. These interfaces allow VCFs to be treated the same as FASTA files when filtering by sequence presence/absence.
  • Removes a check for strains actually present in input sequences (when provided) and relies entirely on the sequence index to filter metadata and strain outputs (this is a breaking change from previous behavior but only affects the use case when a provided sequence index is out of date relative to the provided sequences).
  • Removes ordering of strain names prior to output via --output-strains.

Related issue(s)

Fixes #699
Fixes #424

TODO

Maybe for later PRs:

  • Clean up and refactor code in run that could be in its own functions (e.g., setup grouping, write metadata, etc.)
  • Standardize naming of "strains", "sequences", and "records"

Steps for later PRs:

Testing

  • Tested by CI
  • Tests for probabilistic sampling
  • Tests for subsampling by priority
  • Test with full ncov metadata
  • Test with full Nextstrain build

@huddlej huddlej changed the title Filter chunks Apply filters and group-by logic to a metadata iterator Jul 23, 2021
@codecov
Copy link

codecov bot commented Jul 23, 2021

Codecov Report

Merging #750 (bc076d4) into master (653079e) will increase coverage by 1.93%.
The diff coverage is 58.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #750      +/-   ##
==========================================
+ Coverage   31.52%   33.46%   +1.93%     
==========================================
  Files          41       41              
  Lines        5754     5893     +139     
  Branches     1389     1431      +42     
==========================================
+ Hits         1814     1972     +158     
+ Misses       3865     3837      -28     
- Partials       75       84       +9     
Impacted Files Coverage Δ
augur/index.py 64.86% <10.00%> (+12.23%) ⬆️
augur/filter.py 67.72% <58.63%> (+15.64%) ⬆️
augur/utils.py 41.59% <91.66%> (+1.41%) ⬆️
augur/io.py 100.00% <100.00%> (ø)
augur/export_v2.py 11.04% <0.00%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 653079e...bc076d4. Read the comment docs.

huddlej added 6 commits July 30, 2021 16:27
Reorganizes filtering logic such that we first create a list of filters
to apply through a `construct_filters` function and then apply those
filters through a new `apply_filters` function that loops through each
filter. This new function returns a tuple of strains in three categories
of strains to keep, strains to exclude, and strains to force include.
This function can be applied to a full metadata file or individual
chunks of metadata. The lists of excluded and force-included strains
also include the reasons why these strains were filtered.

The rest of this commit updates the corresponding calling code in `run`
to use these new functions. One intended benefit of this new approach to
filtering is the ability to log the reason why each strain was filtered
or force-included as an additional log output file, `--output-log`.

This commit also fixes a subtle bug in the `filter_by_date` function
where the function accepts a user-defined date column but was not
passing this column to the `get_numerical_dates` function.
Instead of loading all metadata into memory at once, iterate through
fixed-size chunks of data, applying filters to these data and either
grouping data into priority queues or streaming output to disk, as
needed. This approach generally follows the original pseudo-code
solution [1]. Users can override the default chunk size with the new
`--metadata-chunksize` argument to tune the amount of memory used by a
given execution of the filter command. A larger chunk size uses more
memory but may also run slightly faster.

One side-effect of this implementation is that it allows us to log the
reason why each strain was filtered or force-included in a new
`--output-log` argument. One of the output columns of the log file is a
kwargs column that tracks the argument passed to a given filter. This
column is structured text in JSON format which allows for more
sophisticated reporting by specific keys.

Along with this change, we apply the include/exclude logic from files
per file so we can track which specific file was responsible for
including or filtering each strain.

Note that we don't use context manager for CSV reading here. In version
1.2, pandas.read_csv was updated to act as a context manager when
`chunksize` is passed but this same version increased the minimum Python
version supported to 3.7. As a result, pandas for Python 3.6 does not
support the context manager `with` usage. Here, we always iterate
through the `TextFileReader` object instead of using the context
manager, an approach that works in all Python versions.

Finally, this commit changes the subsample seed argument type to an
`int` instead of string or int to match numpy's requirement for its
random generator seed [2]. We do not pass a seed value to numpy random
generator prior to Poisson sampling or the generator will always sample
the same values for a given mean (i.e., all queues will have the same
size). Use the random seed for generating random priorities when none
are provided by the user.

Fixes #424

[1] #699 (comment)
[2] https://numpy.org/doc/stable/reference/random/generator.html
Adds the ability to filter VCFs by a sequence index, to standardize the
filtering interface for both FASTA and VCF inputs. This works by adding
a new `index_vcf` function to the `index` module along with support for
indexing VCFs with `augur index`. This function only reports the strain
names in the given VCF.

We then update the logic of the `filter` module to build a sequence
index when the input is a VCF file and use this to filter the
corresponding metadata. This approach allows us to filter VCF
"sequences" without adding a special VCF-specific filter.
Allow calling code to specify its own valid index columns. This should
enable using `read_metadata` to parse GISAID metadata, for example, or
any other metadata with an identifier stored in a column other than
Augur's defaults of "strain" and "name".
A quick attempt to support GISAID metadata prior to sanitizing, this
adds a way to pass names of columns to use for an index (e.g., "strain",
"name", or "Virus name").
@huddlej huddlej marked this pull request as ready for review July 31, 2021 01:03
@huddlej huddlej requested a review from tsibley July 31, 2021 01:04
@huddlej
Copy link
Contributor Author

huddlej commented Jul 31, 2021

@tsibley This PR supersedes #749 but adds the remainder of the work required to fully reduce memory usage. I tried to group commits into logical chunks (rebasing), but a couple are unavoidably large. Nearly all commits could be squashed after review, but I left a couple unsquashed that might be contentious or at least worth discussing separately.

It might be easiest to walk through this in person like we did with previous PRs, but if you prefer to scan it on your own that's cool, too.

I'll also post some benchmarking results here later when I've finalized those.

augur/io.py Outdated Show resolved Hide resolved
augur/io.py Outdated Show resolved Hide resolved
augur/filter.py Outdated Show resolved Hide resolved
augur/index.py Show resolved Hide resolved
@huddlej
Copy link
Contributor Author

huddlej commented Aug 5, 2021

Some thoughts from chatting with Tom today:

  • The current implementation of two passes through metadata assumes the input can be re-read, but this will not be true for streaming input. To support streaming input, we should stream the metadata to a temporary local file in the first pass and read from this file in the second pass.
    • Since we don't immediately need to support streaming inputs, we can keep the current implementation and add that support in a later PR.
    • The sanitize metadata script in the ncov workflow will need to use a similar spool-to-disk approach because it transforms and deduplicates records. I will test out a generic implementation of the required logic in that script and then port it into augur filter when we've solidified the implementation in ncov. (This has been our pattern for most new functionality in augur lately anyway.)
  • We should consider what interface (i.e., argument/flag names) we want for Augur and then map these names to underlying technology like pandas instead of echoing the names used by pandas. This gives us flexibility to swap out pandas for something else eventually (at least, in some ways; we still refer to .loc, .index.values, etc. everywhere) and also allows us to name the interface in a way that matches our users' preferred nomenclature. For example, instead of referring to "metadata index columns" we could refer to "metadata unique id columns".
  • We do not need to add support for indexing VCFs right now, so the current approach in this PR is more than what's necessary to address the original memory issue. However, we should discuss as a group whether we want to support a "sequence index" for VCFs (something that is technically possible, given the original reference FASTA) because current users would benefit from it or we might in the future (e.g., if we move the ncov workflow to VCF from FASTA MSAs someday).

augur/filter.py Outdated Show resolved Hide resolved
huddlej and others added 5 commits August 6, 2021 10:56
Skip building a sequence index when we know the user is excluding all
strains. The only way to add strains back in with this flag are the
`--include` or `--include-where` options, so we know we don't need a
sequence index to apply any additional filters.
When sequences are provided, we have filtered by the strains present in
these inputs even if users have not requested sequence outputs. This
simplifies the decision about whether to build a sequence index to the
case when a sequence index isn't provided but sequences are. We still
handle the exception where the `--exclude-all` flag skips the sequence
index building step.
Replace open-ended kwargs with specific keyword arguments for
`read_metadata` and name these arguments (and the corresponding augur
filter CLI arguments) to be less pandas-specific and closer to what we'd
call these variables elsewhere in Augur (e.g., "chunksize" becomes
"chunk_size").

As part of this renaming, this commit also fixes a bug in the dtype
assignment for the pandas CSV reader by only specifying a dtype for the
index column we end up using.

This commit also adds a couple of simple doctests for `read_metadata` to
clarify its expected behavior under common use cases.
Listing both the given id columns and the found metadata columns makes
it easier for someone to debug directly from the error message.

Switches from a KeyError to plain Exception since KeyError values should
be just the key name itself.  Includes the full error message in the
doctest as an example of what to expect (and to make sure that our
interpolations are working as intended).
It's worthwhile to test exceptions are actually formatted as intended.
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes look better, I agree. I had one small comment about an error message, but instead of leaving a comment, I pushed a commit (0be8869) with my proposal. I also pushed a tangential change (562b0e4) that I noticed while doing the error message work.

I'll also note what I already noted in Slack: One thing that jumped out is that I'd have broken e975fef up into three commits to make each distinct change smaller. One tell I use when thinking about how to chunk up commits is if I'm tempted to use "also" when describing the changes. That commit uses "also" twice, to introduce two additional changes. :-)

We now filter by observed sequences by always creating a sequence index
when sequence data are provided.
@huddlej huddlej merged commit 3931d5d into master Aug 12, 2021
@huddlej huddlej deleted the filter-chunks branch August 12, 2021 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce metadata memory usage in augur filter Record Filtered and Pruned Sequences
3 participants