[filter] Index input sequences and stream output sequences #627
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Reduces memory needed by augur filter to a smaller constant value
instead of loading all input sequences into memory once and then again
when writing sequences out to disk. The primary improvements here are:
Use a BioPython index data structure [1] that tracks where each
sequence is on disk but does not load sequences into memory. This
structure acts the same as the dictionary structure we originally used
except sequences are loaded lazily when they are requested.
Use an iterator to write sequences back to disk. BioPython's SeqIO
write method accepts an iterator that allows us to stream sequences back
to disk without first loading them into memory as a list of sequence
objects.
[1] http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec66
Testing
This code has been tested with a recent ncov dataset to confirm the same number of sequences are output for a given input and filters. It also passes all functional and unit tests.