-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply filters and group-by logic to a metadata iterator #750
Conversation
Codecov Report
@@ Coverage Diff @@
## master #750 +/- ##
==========================================
+ Coverage 31.52% 33.46% +1.93%
==========================================
Files 41 41
Lines 5754 5893 +139
Branches 1389 1431 +42
==========================================
+ Hits 1814 1972 +158
+ Misses 3865 3837 -28
- Partials 75 84 +9
Continue to review full report at Codecov.
|
Reorganizes filtering logic such that we first create a list of filters to apply through a `construct_filters` function and then apply those filters through a new `apply_filters` function that loops through each filter. This new function returns a tuple of strains in three categories of strains to keep, strains to exclude, and strains to force include. This function can be applied to a full metadata file or individual chunks of metadata. The lists of excluded and force-included strains also include the reasons why these strains were filtered. The rest of this commit updates the corresponding calling code in `run` to use these new functions. One intended benefit of this new approach to filtering is the ability to log the reason why each strain was filtered or force-included as an additional log output file, `--output-log`. This commit also fixes a subtle bug in the `filter_by_date` function where the function accepts a user-defined date column but was not passing this column to the `get_numerical_dates` function.
Instead of loading all metadata into memory at once, iterate through fixed-size chunks of data, applying filters to these data and either grouping data into priority queues or streaming output to disk, as needed. This approach generally follows the original pseudo-code solution [1]. Users can override the default chunk size with the new `--metadata-chunksize` argument to tune the amount of memory used by a given execution of the filter command. A larger chunk size uses more memory but may also run slightly faster. One side-effect of this implementation is that it allows us to log the reason why each strain was filtered or force-included in a new `--output-log` argument. One of the output columns of the log file is a kwargs column that tracks the argument passed to a given filter. This column is structured text in JSON format which allows for more sophisticated reporting by specific keys. Along with this change, we apply the include/exclude logic from files per file so we can track which specific file was responsible for including or filtering each strain. Note that we don't use context manager for CSV reading here. In version 1.2, pandas.read_csv was updated to act as a context manager when `chunksize` is passed but this same version increased the minimum Python version supported to 3.7. As a result, pandas for Python 3.6 does not support the context manager `with` usage. Here, we always iterate through the `TextFileReader` object instead of using the context manager, an approach that works in all Python versions. Finally, this commit changes the subsample seed argument type to an `int` instead of string or int to match numpy's requirement for its random generator seed [2]. We do not pass a seed value to numpy random generator prior to Poisson sampling or the generator will always sample the same values for a given mean (i.e., all queues will have the same size). Use the random seed for generating random priorities when none are provided by the user. Fixes #424 [1] #699 (comment) [2] https://numpy.org/doc/stable/reference/random/generator.html
Adds the ability to filter VCFs by a sequence index, to standardize the filtering interface for both FASTA and VCF inputs. This works by adding a new `index_vcf` function to the `index` module along with support for indexing VCFs with `augur index`. This function only reports the strain names in the given VCF. We then update the logic of the `filter` module to build a sequence index when the input is a VCF file and use this to filter the corresponding metadata. This approach allows us to filter VCF "sequences" without adding a special VCF-specific filter.
Allow calling code to specify its own valid index columns. This should enable using `read_metadata` to parse GISAID metadata, for example, or any other metadata with an identifier stored in a column other than Augur's defaults of "strain" and "name".
A quick attempt to support GISAID metadata prior to sanitizing, this adds a way to pass names of columns to use for an index (e.g., "strain", "name", or "Virus name").
@tsibley This PR supersedes #749 but adds the remainder of the work required to fully reduce memory usage. I tried to group commits into logical chunks (rebasing), but a couple are unavoidably large. Nearly all commits could be squashed after review, but I left a couple unsquashed that might be contentious or at least worth discussing separately. It might be easiest to walk through this in person like we did with previous PRs, but if you prefer to scan it on your own that's cool, too. I'll also post some benchmarking results here later when I've finalized those. |
Some thoughts from chatting with Tom today:
|
Skip building a sequence index when we know the user is excluding all strains. The only way to add strains back in with this flag are the `--include` or `--include-where` options, so we know we don't need a sequence index to apply any additional filters.
When sequences are provided, we have filtered by the strains present in these inputs even if users have not requested sequence outputs. This simplifies the decision about whether to build a sequence index to the case when a sequence index isn't provided but sequences are. We still handle the exception where the `--exclude-all` flag skips the sequence index building step.
Replace open-ended kwargs with specific keyword arguments for `read_metadata` and name these arguments (and the corresponding augur filter CLI arguments) to be less pandas-specific and closer to what we'd call these variables elsewhere in Augur (e.g., "chunksize" becomes "chunk_size"). As part of this renaming, this commit also fixes a bug in the dtype assignment for the pandas CSV reader by only specifying a dtype for the index column we end up using. This commit also adds a couple of simple doctests for `read_metadata` to clarify its expected behavior under common use cases.
Listing both the given id columns and the found metadata columns makes it easier for someone to debug directly from the error message. Switches from a KeyError to plain Exception since KeyError values should be just the key name itself. Includes the full error message in the doctest as an example of what to expect (and to make sure that our interpolations are working as intended).
It's worthwhile to test exceptions are actually formatted as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes look better, I agree. I had one small comment about an error message, but instead of leaving a comment, I pushed a commit (0be8869) with my proposal. I also pushed a tangential change (562b0e4) that I noticed while doing the error message work.
I'll also note what I already noted in Slack: One thing that jumped out is that I'd have broken e975fef up into three commits to make each distinct change smaller. One tell I use when thinking about how to chunk up commits is if I'm tempted to use "also" when describing the changes. That commit uses "also" twice, to introduce two additional changes. :-)
We now filter by observed sequences by always creating a sequence index when sequence data are provided.
Description of proposed changes
This PR rewrites the logic for filtering records to avoid loading all metadata into memory at once. To achieve this goal, we filter metadata in one or two passes through the full data, iterating through a fixed number of chunks in memory at any given time. In the best case scenario (filtering only or grouping by columns with predefined number of sequences per group), we only need one pass through the metadata. During this pass, we can stream records that pass all filters to disk or store the highest priority records in a (fixed size) priority queue to be written later.
In the worst case scenario (grouping by columns with a maximum number of total sequences), we need the first pass to calculate the number of sequences per group and a second pass to select the highest priority sequences in each group.
In addition to this major change in how the filter logic works, this PR also makes the following changes:
--output-log
argument that accepts a tab-delimited file that will contain each filtered strain and its reason for being filtered (or force-included)--metadata-chunksize
argument to control the number of metadata records loaded into memory at a time. This argument allows users to control memory use by Augur to match their compute environment.--probabilistic-sampling
isTrue
and the non-probabilistic number of sequences per group cannot be calculated.augur index
and from the Augur Python API with a newindex_vcf
function. These interfaces allow VCFs to be treated the same as FASTA files when filtering by sequence presence/absence.--output-strains
.Related issue(s)
Fixes #699
Fixes #424
TODO
include_by_query
toinclude_by_include_where
Maybe for later PRs:
run
that could be in its own functions (e.g., setup grouping, write metadata, etc.)Steps for later PRs:
construct_filters
with factory functions that generate filters and integrate with argparse typesTesting