-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define tests for sequence read/write interface #645
Comments
huddlej
added a commit
that referenced
this issue
Dec 30, 2020
Adds tests for new `read_sequences` method based on proposed API [1]. Ignores BioPython and pycov warnings during unit tests to minimize output during test-driven development. Adds code to make most tests pass. [1] #645
huddlej
added a commit
that referenced
this issue
Mar 10, 2021
Adds tests for new `read_sequences` method based on proposed API [1]. Ignores BioPython and pycov warnings during unit tests to minimize output during test-driven development. Adds code to make most tests pass. [1] #645
huddlej
added a commit
that referenced
this issue
Mar 10, 2021
Adds tests and code for new `open_file`, `read_sequences`, and `write_sequences` functions loosely based on a proposed API [1]. These functions transparently handle compressed inputs and outputs using the xopen library. The `open_file` function is a context manager that lightly wraps the `xopen` function and also supports either path strings or existing IO buffers. Both the read and write functions use this context manager to open files. This manager enables the common use case of writing to the same handle many times inside a for loop, by replacing the standard `open` call with `open_file`. Doing so, we maintain a Pythonic interface that also supports compressed file formats and path-or-buffer inputs. This context manager also enables input and output of any other file type in compressed formats (e.g., metadata, sequence indices, etc.). Note that the `read_sequences` and `write_sequences` functions do not infer the format of sequence files (e.g., FASTA, GenBank, etc.). Inferring file formats requires peeking at the first record in each given input, but peeking is not supported by piped inputs that we want to support (e.g., piped gzip inputs from xopen). There are also no internal use cases for Augur to read multiple sequences of different formats, so I can't currently justify the complexity required to support type inference. Instead, I opted for the same approach used by BioPython where the calling code must know the type of input file being passed. This isn't an unreasonable expectation for Augur's internal code. I also considered inferring file type by filename extensions like xopen infers compression modes. Filename extensions are less standardized across bioinformatics than we would like for this type of inference to work robustly. Tests ignore BioPython and pycov warnings to minimize warning fatigue for issues we cannot address during test-driven development. [1] #645
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Use cases
Only use cases 1-5 are addressed by the proposed interfaces below. Multiple sequence alignments and sequence annotations require different treatment since the former can require loading sequences into memory and the latter are a different type of data.
Existing functions
SeqIO.parse
SeqIO.read
read_sequences
:augur/augur/align.py
Lines 178 to 192 in efcc8be
read_reference
:augur/augur/align.py
Lines 222 to 231 in efcc8be
read_alignment
:augur/augur/align.py
Lines 201 to 205 in efcc8be
load_alignments
:augur/augur/reconstruct_sequences.py
Lines 50 to 55 in efcc8be
Ideal interface
SeqIO.parse
is a nice generic interface for reading since it accepts a file handle and a given file type. Ideally, augur’s interface would abstract the work to open a file handle for the given filename (with or without compression) such that the remaining work of iterating through the handle can be passed to BioPython.Ideally, augur would also handle the logic of identifying the file type by its content instead of requiring users to specify the type (e.g., FASTA, GenBank, etc.).
Finally, at least one use case requires the ability to read through multiple files. The read interface should not be concerned with deduplication of those input sequences; that logic should be implemented elsewhere by a function that consumes the sequences iterator.
The following code mocks up potential interfaces for reading and writing sequences with and without compression. The write interfaces only consider writing FASTA output, but we may want to add a
format
argument that allows us to write out other formats like GenBank.The write interface options are more complicated because writing often happens inside a loop and involves calling the write function multiple times for the same output handle. I prefer the more restricted interface above that expects an iterable as input. This interface forces the calling code to prepare the input as an iterator and likely refactor code originally inside a loop into its own function.
We may want to consider implementing these new functions in a new
io
augur module and eventually migrating other I/O functions from the genericutils
module to the more clearly definedio
module.The text was updated successfully, but these errors were encountered: