[filter] Index input sequences and stream output sequences #627

huddlej · 2020-11-05T22:27:52Z

Description of proposed changes

Reduces memory needed by augur filter to a smaller constant value
instead of loading all input sequences into memory once and then again
when writing sequences out to disk. The primary improvements here are:

Use a BioPython index data structure [1] that tracks where each
sequence is on disk but does not load sequences into memory. This
structure acts the same as the dictionary structure we originally used
except sequences are loaded lazily when they are requested.
Use an iterator to write sequences back to disk. BioPython's SeqIO
write method accepts an iterator that allows us to stream sequences back
to disk without first loading them into memory as a list of sequence
objects.

[1] http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec66

Testing

This code has been tested with a recent ncov dataset to confirm the same number of sequences are output for a given input and filters. It also passes all functional and unit tests.

Reduces memory needed by augur filter to a smaller constant value instead of loading all input sequences into memory once and then again when writing sequences out to disk. The primary improvements here are: 1. Use a BioPython index data structure [1] that tracks where each sequence is on disk but does not load sequences into memory. This structure acts the same as the dictionary structure we originally used except sequences are loaded lazily when they are requested. 2. Use an iterator to write sequences back to disk. BioPython's SeqIO write method accepts an iterator that allows us to stream sequences back to disk without first loading them into memory as a list of sequence objects. [1] http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec66

codecov · 2020-11-05T22:32:59Z

Codecov Report

Merging #627 into master will increase coverage by 0.02%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master     #627      +/-   ##
==========================================
+ Coverage   28.53%   28.56%   +0.02%     
==========================================
  Files          39       39              
  Lines        5365     5367       +2     
  Branches     1319     1320       +1     
==========================================
+ Hits         1531     1533       +2     
  Misses       3778     3778              
  Partials       56       56

Impacted Files	Coverage Δ
augur/filter.py	`45.16% <83.33%> (+0.35%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a712db...9febe77. Read the comment docs.

rneher

I was initially worried that this would come at a big performance cost, but this doesn't seem to be the case.

huddlej · 2020-11-06T16:39:31Z

Thank you for testing this out, @rneher! I'll merge now.

huddlej requested review from emmahodcroft, rneher and jameshadfield November 5, 2020 22:32

huddlej added this to the Next release 10.x.x milestone Nov 5, 2020

rneher approved these changes Nov 6, 2020

View reviewed changes

huddlej merged commit 5c1aacc into master Nov 6, 2020

huddlej deleted the stream-filter-seqs branch November 6, 2020 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[filter] Index input sequences and stream output sequences #627

[filter] Index input sequences and stream output sequences #627

huddlej commented Nov 5, 2020 •

edited

Loading

codecov bot commented Nov 5, 2020 •

edited

Loading

rneher left a comment

huddlej commented Nov 6, 2020

[filter] Index input sequences and stream output sequences #627

[filter] Index input sequences and stream output sequences #627

Conversation

huddlej commented Nov 5, 2020 • edited Loading

Description of proposed changes

Testing

codecov bot commented Nov 5, 2020 • edited Loading

Codecov Report

rneher left a comment

Choose a reason for hiding this comment

huddlej commented Nov 6, 2020

huddlej commented Nov 5, 2020 •

edited

Loading

codecov bot commented Nov 5, 2020 •

edited

Loading