Support compressed alignments #696

huddlej · 2021-03-16T21:21:30Z

Description of proposed changes

Adds support for compressed inputs (reference files and alignment sequences) in augur align by refactoring existing code to use Augur's io module.

This is a work in progress that builds on #652 and still requires focused work to add support for compressed output files.

Testing

Adds functional tests to the Zika compressed build in tests/builds/zika_compressed.t.

Adds tests and code for new `open_file`, `read_sequences`, and `write_sequences` functions loosely based on a proposed API [1]. These functions transparently handle compressed inputs and outputs using the xopen library. The `open_file` function is a context manager that lightly wraps the `xopen` function and also supports either path strings or existing IO buffers. Both the read and write functions use this context manager to open files. This manager enables the common use case of writing to the same handle many times inside a for loop, by replacing the standard `open` call with `open_file`. Doing so, we maintain a Pythonic interface that also supports compressed file formats and path-or-buffer inputs. This context manager also enables input and output of any other file type in compressed formats (e.g., metadata, sequence indices, etc.). Note that the `read_sequences` and `write_sequences` functions do not infer the format of sequence files (e.g., FASTA, GenBank, etc.). Inferring file formats requires peeking at the first record in each given input, but peeking is not supported by piped inputs that we want to support (e.g., piped gzip inputs from xopen). There are also no internal use cases for Augur to read multiple sequences of different formats, so I can't currently justify the complexity required to support type inference. Instead, I opted for the same approach used by BioPython where the calling code must know the type of input file being passed. This isn't an unreasonable expectation for Augur's internal code. I also considered inferring file type by filename extensions like xopen infers compression modes. Filename extensions are less standardized across bioinformatics than we would like for this type of inference to work robustly. Tests ignore BioPython and pycov warnings to minimize warning fatigue for issues we cannot address during test-driven development. [1] #645

Adds support to augur index for compressed sequence inputs and index outputs.

Adds tests for augur parse and mask and then refactors these modules to use the new read/write interface. For augur parse, the refactor moves from an original for loop into its own `parse_sequence` function, adds tests for this new function, and updates the body of the `run` function to use this function inside the for loop. This commit also replaces the Bio.SeqIO read and write functions with the new `read_sequences` and `write_sequences` functions. These functions support compressed input and output files based on the filename extensions. For augur mask, the refactor moves logic for masking individual sequences into its own function and replaces Bio.SeqIO calls with new `read_sequences` and `write_sequences` functions. The refactoring of the `mask_sequence` function allows us to easily define a generator for the output sequences to write and make a single call to `write_sequences`.

Documents which steps of a standard build support compressed inputs/outputs by adding a copy of the Zika build test and corresponding expected compressed inputs/outputs.

Adds support for compressed inputs (reference files and alignment sequences) in augur align by refactoring existing code to use Augur's `io` module. This is a work in progress and still requires focused work to add support for compressed output files.

tsibley

Read the code, left a few notes that might (or might not) be helpful things to think more about.

Rebasing this branch onto the latest state of master would reduce the PR to just the head commit, which is all I looked at since the others are merged already. GitHub's review interface definitely doesn't make this clear, though.

tsibley · 2021-05-12T21:51:06Z

augur/align.py

@@ -58,11 +63,12 @@ def prepare(sequences, existing_aln_fname, output, ref_name, ref_seq_fname):
    seqs = read_sequences(*sequences)
    seqs_to_align_fname = output + ".to_align.fasta"

+    existing_aln = None
+    existing_aln_sequence_names = set()
+
    if existing_aln_fname:
        existing_aln = read_alignment(existing_aln_fname)
        seqs = prune_seqs_matching_alignment(seqs, existing_aln)


Should existing_aln_sequence_names be set here as well based on alignment content? It seems like it's only set below when writing out the existing alignment + new reference? But the interaction here is a bit unclear to me, so not sure I've understood correctly.

tsibley · 2021-05-12T22:00:06Z

augur/align.py

@@ -178,19 +188,34 @@ def postprocess(output_file, ref_name, keep_reference, fill_gaps):

 def read_sequences(*fnames):
    """return list of sequences from all fnames"""


list → iterable? Or → generator?

tsibley · 2021-05-12T22:01:53Z

augur/align.py

    Return a set of seqs excluding those already in the alignment & print a warning
    message for each sequence which is exluded.


set → iterable? → generator?

tsibley · 2021-05-12T22:03:46Z

augur/align.py

+        for record in io_read_sequences(*fnames):
+            # Hash each sequence and check whether another sequence with the
+            # same name already exists and if the hash is different.
+            sequence_hash = hashlib.sha256(str(record.seq).encode("utf-8")).hexdigest()


In order to avoid spurious errors about mismatched duplicate sequences, should we make a best effort attempt at normalizing record.seq before hashing it?

huddlej · 2021-05-24T16:43:01Z

Thank you, @tsibley! A rebase would be really helpful here. Your comments highlights parts of the align module that are unclear to me, too. We may need to revisit the general approach of this module to make it easier to follow and also to better support new features like additional alignment backends.

genehack · 2025-01-13T20:18:47Z

@huddlej do you think this is something that could be picked back up, or has the codebase moved on past this and the best course of action is closing this out?

huddlej · 2025-01-14T22:54:10Z

This is still an important feature, but I don't know how important compared to other priorities. I'd either leave this open as the reflection of the issue or convert to an issue before closing.

huddlej added 7 commits March 10, 2021 10:38

Support compressed inputs/outputs for index

0a9d742

Adds support to augur index for compressed sequence inputs and index outputs.

Update filter to use new IO interface

071023d

Add docstring for mask sequence function

f6c61f1

Add Zika build test for compressed inputs/outputs

46b8a65

Documents which steps of a standard build support compressed inputs/outputs by adding a copy of the Zika build test and corresponding expected compressed inputs/outputs.

huddlej self-assigned this Mar 19, 2021

tsibley reviewed May 12, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support compressed alignments #696

Support compressed alignments #696

huddlej commented Mar 16, 2021

tsibley left a comment

tsibley May 12, 2021

tsibley May 12, 2021

tsibley May 12, 2021

tsibley May 12, 2021

huddlej commented May 24, 2021

genehack commented Jan 13, 2025

huddlej commented Jan 14, 2025

		@@ -178,19 +188,34 @@ def postprocess(output_file, ref_name, keep_reference, fill_gaps):

		def read_sequences(*fnames):
		"""return list of sequences from all fnames"""

		Return a set of seqs excluding those already in the alignment & print a warning
		message for each sequence which is exluded.

Support compressed alignments #696

Are you sure you want to change the base?

Support compressed alignments #696

Conversation

huddlej commented Mar 16, 2021

Description of proposed changes

Testing

tsibley left a comment

Choose a reason for hiding this comment

tsibley May 12, 2021

Choose a reason for hiding this comment

tsibley May 12, 2021

Choose a reason for hiding this comment

tsibley May 12, 2021

Choose a reason for hiding this comment

tsibley May 12, 2021

Choose a reason for hiding this comment

huddlej commented May 24, 2021

genehack commented Jan 13, 2025

huddlej commented Jan 14, 2025