-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add read/write sequence interface with support for compressed sequences #652
Conversation
Codecov Report
@@ Coverage Diff @@
## master #652 +/- ##
==========================================
+ Coverage 30.52% 31.57% +1.05%
==========================================
Files 40 41 +1
Lines 5615 5779 +164
Branches 1363 1436 +73
==========================================
+ Hits 1714 1825 +111
- Misses 3830 3848 +18
- Partials 71 106 +35
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By request, I took a pass at reviewing your patches here. They're perhaps not the most insightful comments but here's what I have for today :)
Thank you for the review, @kairstenfay! Do you have any general thoughts on the read/write interfaces? The more I've thought about it, the more I don't like passing the file handle around to |
I don't have a lot of context or experience using augur, so I hesitate to comment on the interfaces. However, I did think it was a bit odd to pass around the file handler arguments to the |
This looks good to me John. I installed this on our cluster. Builds on Monday will use it (and I'll pass in the raw gz files at the top). A bunch of ncov scripts would need adjusting, so I am not changing everything to compressed just yet. |
Thank you for trying this out, @rneher. I'm interested to know what issues you encounter. :) From the engineering perspective, I'd to make a quick attempt at implementing these two new functions as classes, as discussed with @kairstenfay above. Since my hope is to make this part of the minimal public-facing Python API that we've discussed, I want to make sure this interface is as good as it can be from the start. The last little lift to finishing this work is to update all augur commands that read or write sequences to use this new interface. I'd prefer not to make this change piece-meal, so users don't get different experiences with different commands. |
I merged master into it on scicore and I am running the builds locally w/o problem. But I have only tried compressed files in filter. |
94179a2
to
57ad120
Compare
Adds tests and code for new `open_file`, `read_sequences`, and `write_sequences` functions loosely based on a proposed API [1]. These functions transparently handle compressed inputs and outputs using the xopen library. The `open_file` function is a context manager that lightly wraps the `xopen` function and also supports either path strings or existing IO buffers. Both the read and write functions use this context manager to open files. This manager enables the common use case of writing to the same handle many times inside a for loop, by replacing the standard `open` call with `open_file`. Doing so, we maintain a Pythonic interface that also supports compressed file formats and path-or-buffer inputs. This context manager also enables input and output of any other file type in compressed formats (e.g., metadata, sequence indices, etc.). Note that the `read_sequences` and `write_sequences` functions do not infer the format of sequence files (e.g., FASTA, GenBank, etc.). Inferring file formats requires peeking at the first record in each given input, but peeking is not supported by piped inputs that we want to support (e.g., piped gzip inputs from xopen). There are also no internal use cases for Augur to read multiple sequences of different formats, so I can't currently justify the complexity required to support type inference. Instead, I opted for the same approach used by BioPython where the calling code must know the type of input file being passed. This isn't an unreasonable expectation for Augur's internal code. I also considered inferring file type by filename extensions like xopen infers compression modes. Filename extensions are less standardized across bioinformatics than we would like for this type of inference to work robustly. Tests ignore BioPython and pycov warnings to minimize warning fatigue for issues we cannot address during test-driven development. [1] #645
Adds support to augur index for compressed sequence inputs and index outputs.
Adds tests for augur parse and mask and then refactors these modules to use the new read/write interface. For augur parse, the refactor moves from an original for loop into its own `parse_sequence` function, adds tests for this new function, and updates the body of the `run` function to use this function inside the for loop. This commit also replaces the Bio.SeqIO read and write functions with the new `read_sequences` and `write_sequences` functions. These functions support compressed input and output files based on the filename extensions. For augur mask, the refactor moves logic for masking individual sequences into its own function and replaces Bio.SeqIO calls with new `read_sequences` and `write_sequences` functions. The refactoring of the `mask_sequence` function allows us to easily define a generator for the output sequences to write and make a single call to `write_sequences`.
57ad120
to
071023d
Compare
Documents which steps of a standard build support compressed inputs/outputs by adding a copy of the Zika build test and corresponding expected compressed inputs/outputs.
Thanks @huddlej -- I spent some time looking through this and like the interface. We have an Running nCoV with (xz) compressed metadata worked 💯. This actually surprised me as it means P.S. for future work when we allow |
Thank you for looking through this, @jameshadfield!
The old The old function has some functionality that the new function does not including:
As far as how we handle this older function, I'd prefer to replace all internal references to that function with the new function now and then handle deprecation/migration of these types of I/O functions from
Yeah! We get this functionality "for free" by using pandas to read in the metadata. pandas has an interesting I/O library, too, where they effectively implement their own version of the xopen module (I would guess the pandas code predates the xopen module even though xopen was inspired by Heng Li's function of the same name from 2014!).
That's a good point. I'd vote to implement our own wrapper around any TreeTime functionality for this kind of I/O, so we can more easily replace/update the backend in the future. |
It turns out xopen does not support passing arguments like |
Replaces calls to a similar `open_file` function from `utils.py` with the new function in `io.py`. Updates the functional tests for the mask module to confirm that compressed VCF inputs work with the old and new function alike.
Description of proposed changes
This PR adds a new
io.py
module with three new functions:open_file
: a context manager that transparently supports reading/writing:read_sequences
: a function that streams sequence records from one or more input files of the same format (FASTA, GenBank, etc.). Usesopen_file
to support compressed inputs.write_sequences
: a function that streams sequence records to a single filename or file handle. Usesopen_file
to support compressed output.This PR also refactors the following Augur subcommands to use these new functions:
parse
index
filter
mask
Python API example
Command line interface examples
The following command reads in H3N2 HA sequences from a gzip-compressed file and writes out the parsed sequences to an LZMA-compressed file.
The following command reads LZMA-compressed sequences from the previous command and writes out a gzip-compressed sequence index.
Related issue(s)
See the ZenHub Epic for a list of all related issues.
Fixes #644
See #637 for details about how Augur reads and write sequences.
See #645 for the original proposed interface for the new read/write functions.
Testing
This PR adds functional and unit tests for all new and refactored code.