Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read/write sequence interface with support for compressed sequences #652

Merged
merged 7 commits into from
Mar 18, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Dec 31, 2020

Description of proposed changes

This PR adds a new io.py module with three new functions:

  • open_file: a context manager that transparently supports reading/writing:
    • compressed files using xopen to infer compression from the filename
    • file paths or file handles
  • read_sequences: a function that streams sequence records from one or more input files of the same format (FASTA, GenBank, etc.). Uses open_file to support compressed inputs.
  • write_sequences: a function that streams sequence records to a single filename or file handle. Uses open_file to support compressed output.

This PR also refactors the following Augur subcommands to use these new functions:

  • parse
  • index
  • filter
  • mask

Python API example

# Read sequences from multiple files with a generator.
sequences = read_sequences(*args.sequences)

# Open a file to write to. If `args.output` ends with
# ".gz", for example, its contents will compressed.
observed_sequence_strains = set()
with open_file(args.output, "wt") as output_handle:
    for sequence in sequences:
        # Track all the strains we've written.
        observed_sequence_strains.add(sequence.id)

        # Write one record at a time to the handle.
        write_sequences(sequence, output_handle)

Command line interface examples

The following command reads in H3N2 HA sequences from a gzip-compressed file and writes out the parsed sequences to an LZMA-compressed file.

augur parse \
  --sequences h3n2_ha.fasta.gz \
  --output-sequences sequences.fasta.xz \
  --output-metadata metadata.tsv \
  --fields strain virus accession date region country division location passage originating_lab submitting_lab age gender

The following command reads LZMA-compressed sequences from the previous command and writes out a gzip-compressed sequence index.

augur index \
    --sequences sequences.fasta.xz \
    --output sequence_index.tsv.gz

Related issue(s)

See the ZenHub Epic for a list of all related issues.
Fixes #644
See #637 for details about how Augur reads and write sequences.
See #645 for the original proposed interface for the new read/write functions.

Testing

This PR adds functional and unit tests for all new and refactored code.

@codecov
Copy link

codecov bot commented Dec 31, 2020

Codecov Report

Merging #652 (bbf963f) into master (8df4b4d) will increase coverage by 1.05%.
The diff coverage is 77.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #652      +/-   ##
==========================================
+ Coverage   30.52%   31.57%   +1.05%     
==========================================
  Files          40       41       +1     
  Lines        5615     5779     +164     
  Branches     1363     1436      +73     
==========================================
+ Hits         1714     1825     +111     
- Misses       3830     3848      +18     
- Partials       71      106      +35     
Impacted Files Coverage Δ
augur/index.py 80.70% <33.33%> (-1.45%) ⬇️
augur/parse.py 62.50% <47.82%> (+12.50%) ⬆️
augur/filter.py 48.78% <100.00%> (+2.93%) ⬆️
augur/io.py 100.00% <100.00%> (ø)
augur/mask.py 100.00% <100.00%> (ø)
augur/utils.py 37.76% <0.00%> (-0.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8df4b4d...bbf963f. Read the comment docs.

@huddlej huddlej requested a review from tsibley December 31, 2020 00:52
@huddlej huddlej assigned jameshadfield and rneher and unassigned jameshadfield and rneher Dec 31, 2020
@huddlej huddlej marked this pull request as ready for review December 31, 2020 00:55
@huddlej huddlej self-assigned this Jan 16, 2021
Copy link
Contributor

@kairstenfay kairstenfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By request, I took a pass at reviewing your patches here. They're perhaps not the most insightful comments but here's what I have for today :)

augur/io.py Outdated Show resolved Hide resolved
augur/mask.py Show resolved Hide resolved
augur/mask.py Show resolved Hide resolved
augur/parse.py Show resolved Hide resolved
tests/test_io.py Outdated Show resolved Hide resolved
@huddlej
Copy link
Contributor Author

huddlej commented Jan 21, 2021

Thank you for the review, @kairstenfay! Do you have any general thoughts on the read/write interfaces? The more I've thought about it, the more I don't like passing the file handle around to write_sequences. These two functions should probably be classes like a SequenceReader and SequenceWriter. This implementation would allow the latter class to track the handle as an attribute. It could also allow the sequence reader to provide fancier functions like streaming out unique sequences, etc.

@kairstenfay
Copy link
Contributor

Thank you for the review, @kairstenfay! Do you have any general thoughts on the read/write interfaces? The more I've thought about it, the more I don't like passing the file handle around to write_sequences. These two functions should probably be classes like a SequenceReader and SequenceWriter. This implementation would allow the latter class to track the handle as an attribute. It could also allow the sequence reader to provide fancier functions like streaming out unique sequences, etc.

I don't have a lot of context or experience using augur, so I hesitate to comment on the interfaces. However, I did think it was a bit odd to pass around the file handler arguments to the write_sequences function. With my superficial understanding, creating SequenceReader & SequenceWriter classes sounds like a good approach to me.

@rneher
Copy link
Member

rneher commented Jan 30, 2021

This looks good to me John. I installed this on our cluster. Builds on Monday will use it (and I'll pass in the raw gz files at the top). A bunch of ncov scripts would need adjusting, so I am not changing everything to compressed just yet.

@huddlej
Copy link
Contributor Author

huddlej commented Feb 1, 2021

Thank you for trying this out, @rneher. I'm interested to know what issues you encounter. :)

From the engineering perspective, I'd to make a quick attempt at implementing these two new functions as classes, as discussed with @kairstenfay above. Since my hope is to make this part of the minimal public-facing Python API that we've discussed, I want to make sure this interface is as good as it can be from the start.

The last little lift to finishing this work is to update all augur commands that read or write sequences to use this new interface. I'd prefer not to make this change piece-meal, so users don't get different experiences with different commands.

@rneher
Copy link
Member

rneher commented Feb 2, 2021

I merged master into it on scicore and I am running the builds locally w/o problem. But I have only tried compressed files in filter.

@huddlej huddlej force-pushed the add-sequence-interface branch from 94179a2 to 57ad120 Compare March 10, 2021 07:05
huddlej added 4 commits March 10, 2021 10:38
Adds tests and code for new `open_file`, `read_sequences`, and
`write_sequences` functions loosely based on a proposed API [1]. These
functions transparently handle compressed inputs and outputs using the
xopen library.

The `open_file` function is a context manager that lightly wraps the
`xopen` function and also supports either path strings or existing IO
buffers. Both the read and write functions use this context manager to
open files. This manager enables the common use case of writing to the
same handle many times inside a for loop, by replacing the standard
`open` call with `open_file`. Doing so, we maintain a Pythonic interface
that also supports compressed file formats and path-or-buffer inputs.
This context manager also enables input and output of any other file
type in compressed formats (e.g., metadata, sequence indices, etc.).

Note that the `read_sequences` and `write_sequences` functions do not
infer the format of sequence files (e.g., FASTA, GenBank, etc.).
Inferring file formats requires peeking at the first record in each
given input, but peeking is not supported by piped inputs that we want
to support (e.g., piped gzip inputs from xopen). There are also no
internal use cases for Augur to read multiple sequences of different
formats, so I can't currently justify the complexity required to support
type inference. Instead, I opted for the same approach used by BioPython
where the calling code must know the type of input file being passed.
This isn't an unreasonable expectation for Augur's internal code. I also
considered inferring file type by filename extensions like xopen infers
compression modes. Filename extensions are less standardized across
bioinformatics than we would like for this type of inference to work
robustly.

Tests ignore BioPython and pycov warnings to minimize warning fatigue
for issues we cannot address during test-driven development.

[1] #645
Adds support to augur index for compressed sequence inputs and index
outputs.
Adds tests for augur parse and mask and then refactors these modules to
use the new read/write interface.

For augur parse, the refactor moves from an original for loop into its
own `parse_sequence` function, adds tests for this new function, and
updates the body of the `run` function to use this function inside the
for loop. This commit also replaces the Bio.SeqIO read and write
functions with the new `read_sequences` and `write_sequences` functions.
These functions support compressed input and output files based on the
filename extensions.

For augur mask, the refactor moves logic for masking individual
sequences into its own function and replaces Bio.SeqIO calls with new
`read_sequences` and `write_sequences` functions. The refactoring of the
`mask_sequence` function allows us to easily define a generator for the
output sequences to write and make a single call to `write_sequences`.
@huddlej huddlej force-pushed the add-sequence-interface branch from 57ad120 to 071023d Compare March 10, 2021 18:52
huddlej added 2 commits March 15, 2021 16:54
Documents which steps of a standard build support compressed
inputs/outputs by adding a copy of the Zika build test and corresponding
expected compressed inputs/outputs.
@jameshadfield
Copy link
Member

jameshadfield commented Mar 18, 2021

Thanks @huddlej -- I spent some time looking through this and like the interface.

We have an open_file function in utils which we no longer use, should we add a deprecation warning to this?

Running nCoV with (xz) compressed metadata worked 💯. This actually surprised me as it means read_metadata handles xz files already!


P.S. for future work when we allow augur tree to take compressed input, be aware that VCF inputs, which need a fasta reference, use the TreeTime helper function read_vcf rather than our own helper functions.

@huddlej
Copy link
Contributor Author

huddlej commented Mar 18, 2021

Thank you for looking through this, @jameshadfield!

We have an open_file function in utils which we no longer use, should we add a deprecation warning to this?

The old open_file function was added last March as part of a mask.py refactor and it has only ever been used in that module. We still use this older function in mask.py to open VCF files.

The old function has some functionality that the new function does not including:

  1. a check for opening gzipped files in "text" mode. The xopen package implements the "text" mode check already, so we don't need to implement that anymore.
  2. an explicit request for UTF-8 encoding. Based on @tsibley's excellent commit message when this UTF-8 encoding was added, we should continue to use this as an explicit default and still allow the calling code to override the default.

As far as how we handle this older function, I'd prefer to replace all internal references to that function with the new function now and then handle deprecation/migration of these types of I/O functions from utils.py to io.py in a later PR.

Running nCoV with (xz) compressed metadata worked 💯. This actually surprised me as it means read_metadata handles xz files already!

Yeah! We get this functionality "for free" by using pandas to read in the metadata. pandas has an interesting I/O library, too, where they effectively implement their own version of the xopen module (I would guess the pandas code predates the xopen module even though xopen was inspired by Heng Li's function of the same name from 2014!).

for future work when we allow augur tree to take compressed input, be aware that VCF inputs, which need a fasta reference, use the TreeTime helper function read_vcf rather than our own helper functions.

That's a good point. I'd vote to implement our own wrapper around any TreeTime functionality for this kind of I/O, so we can more easily replace/update the backend in the future.

@huddlej
Copy link
Contributor Author

huddlej commented Mar 18, 2021

It turns out xopen does not support passing arguments like encoding through to the internal open function call. This may not be an issue for sequence data, since BioPython appears to do its own string decoding, but it could be an issue for other data types. I'm not sure if this is a deal-breaker or not...

Replaces calls to a similar `open_file` function from `utils.py` with
the new function in `io.py`. Updates the functional tests for the mask
module to confirm that compressed VCF inputs work with the old and new
function alike.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support compressed sequences
4 participants