-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce metadata memory usage in augur filter #699
Comments
Given that |
That sounds like a great plan. I'm happy to see how widespread |
After some more reflection, the main issue with reading all metadata into memory occurs when users apply Given this main issue, the scope of the current issue can be reduced to improving the memory usage of metadata in
Given the new scope of the issue, though, we do not necessarily need to implement a new interface for accessing metadata across augur; we only need to support more efficient processing in augur filter. I think we can accomplish this goal by refactoring augur filter to iterate over chunks of metadata records (N=1 or more records per chunk). This iteration needs to account for three major use cases:
Several technical solutions could work. One could write valid strains and their priorities out to a temporary file for each group and process these files after the first pass through the metadata. This solution can effectively require two passes through the metadata if filters do not remove many records. Alternately, one could write records to a database (e.g., sqlite) indexed by group and strain and select the top records from each group ordered by priority in the second pass. Another option would be to maintain a fixed-size min heap per group in memory such that each newly accepted strain is added to the corresponding heap if it has a higher priority than any other previously accepted strains. After the first pass, each heap will contain no more than N (highest priority or random) strains per group and augur filter can emit these records to
One possible pseudo-code version of the proposed algorithm looks like this:
Implicit in the pseudo-code implementation above is that the current multiple lines of filter logic in Also implicit is that the probabilistic subsampling logic could be refactored into its own function and potentially removed from the user interface (i.e., we could only use probabilistic sampling when it is necessary). |
We can split the bigger issue of memory used by augur filter into the following smaller tasks:
|
@huddlej Overall impression is that this plan sounds excellent to me. 👍 for focusing the scope here and breaking down to incremental steps. In terms of implementations, I think I'm partial to the in-memory heap or on-disk SQLite approach. Not sure which between the two but would be happy to discuss more in person! |
This is nearly to the point described by the correspondig pseudo-code [1], but it is still broken. The major remaining work includes the second pass through the metadata, processing of heaps, and updating the filter reporting to use more programmatic output. [1] #699 (comment)
Instead of loading all metadata into memory at once, iterate through fixed-size chunks of data, applying filters to these data and either grouping data into heaps or streaming output to disk, as needed. This approach generally follows the original pseudo-code solution [1]. One side-effect of this implementation is that it allows us to log the reason why each strain was filtered or force-included in a new `--output-log` argument. One of the output columns of the log file is a kwargs column that tracks the argument passed to a given filter. This column is structured text in JSON format which allows for more sophisticated reporting by specific keys. Along with this change, we apply the include/exclude logic from files per file so we can track which specific file was responsible for including or filtering each strain. Fixes #424 [1] #699 (comment)
Instead of loading all metadata into memory at once, iterate through fixed-size chunks of data, applying filters to these data and either grouping data into priority queues or streaming output to disk, as needed. This approach generally follows the original pseudo-code solution [1]. Users can override the default chunk size with the new `--metadata-chunksize` argument to tune the amount of memory used by a given execution of the filter command. A larger chunk size uses more memory but may also run slightly faster. One side-effect of this implementation is that it allows us to log the reason why each strain was filtered or force-included in a new `--output-log` argument. One of the output columns of the log file is a kwargs column that tracks the argument passed to a given filter. This column is structured text in JSON format which allows for more sophisticated reporting by specific keys. Along with this change, we apply the include/exclude logic from files per file so we can track which specific file was responsible for including or filtering each strain. Note that we don't use context manager for CSV reading here. In version 1.2, pandas.read_csv was updated to act as a context manager when `chunksize` is passed but this same version increased the minimum Python version supported to 3.7. As a result, pandas for Python 3.6 does not support the context manager `with` usage. Here, we always iterate through the `TextFileReader` object instead of using the context manager, an approach that works in all Python versions. Finally, this commit changes the subsample seed argument type to an `int` instead of string or int to match numpy's requirement for its random generator seed [2]. We do not pass a seed value to numpy random generator prior to Poisson sampling or the generator will always sample the same values for a given mean (i.e., all queues will have the same size). Use the random seed for generating random priorities when none are provided by the user. Fixes #424 [1] #699 (comment) [2] https://numpy.org/doc/stable/reference/random/generator.html
Current Behavior
While profiling memory usage for Nextstrain's SARS-CoV-2 workflow, I was surprised to find that
augur filter
rules consumed >2 GB of memory per job. Through memory profiling with the memory_profiler package, I found that the majority of this memory usage was caused by reading in the ~165 MB metadata input with theread_metadata
function inutils.py
. This puts the in-memory size of metadata at ~13 times the on-disk size.In comparison, workflow scripts that read the same metadata file directly into a pandas DataFrame (e.g.,
scripts/assign-colors.py
) only consumed ~400 MB of memory per job or 2-3 times the on-disk size of the data. These results indicate that the pandas DataFrame representation is considerably more memory efficient than the alternate Pythondict
representation returned by theread_metadata
function.Expected behavior
Augur should not use as much memory to represent metadata as it does, especially when alternate representations (pandas DataFrames) provide a better alternative.
How to reproduce
Run
augur filter
with metadata from a recent ncov workflow run.Possible solution
Since Augur already loads metadata into a DataFrame before finally converting it to a
dict
, my recommendation is to always use a DataFrame to represent metadata internally. This solution would cause a breaking change to the current API, such that all consumers of theread_metadata
function would need to be updated to expect a different data structure.However, this data structure provides a more flexible interface (most of the operations we apply in
filter.py
could be rewritten as simple DataFrame operations) and more potential to scale with larger datasets in the future through more efficient data types, chunking, or replacement with an alternate framework like Dask that implements a similar API.The text was updated successfully, but these errors were encountered: