Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce number of iterations over aligned sequences #465

Closed
tolot27 opened this issue Mar 24, 2020 · 10 comments
Closed

reduce number of iterations over aligned sequences #465

tolot27 opened this issue Mar 24, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request moderate problem Requires an average amount of work please take this issue Extra attention is needed priority: low To be resolved after high and moderate priority issues

Comments

@tolot27
Copy link
Contributor

tolot27 commented Mar 24, 2020

This issue is extracted from the original post in #462 (comment)_ and should address a slightly increasing performance degradation:

Iterating over all aligned sequences three times, modifying it (uppercasing, gap replacement) and writing it to disk is maybe slow for large alignments and wastes I/O. A fourth iteration was recently added with commit e3d1848 in line

return [seq for seq in aln if (keep_reference or seq.name != reference)]
.

@groutr
Copy link
Contributor

groutr commented Mar 25, 2020

I would like to work on this issue. Is there a input file that would be good to benchmark/profile things with?

EDIT: nvm. I guess I'll give feedback on the existing PR.

@huddlej
Copy link
Contributor

huddlej commented Jun 26, 2020

@tolot27 @groutr If either of you are still interested in working on this issue, the most relevant data to test with might be recent SARS-CoV-2 sequences described in the ncov repo docs.

If you are interested, let me know if there is anyway I can help.

@huddlej huddlej added priority: low To be resolved after high and moderate priority issues moderate problem Requires an average amount of work please take this issue Extra attention is needed labels Jun 26, 2020
@groutr
Copy link
Contributor

groutr commented Jul 7, 2020

@huddlej I'll take a look at this.

@huddlej
Copy link
Contributor

huddlej commented Jul 7, 2020

Thank you, @groutr!

@groutr
Copy link
Contributor

groutr commented Jul 19, 2020

@huddlej Hi. I've been looking at this in my free time over the past week or so and simply wanted to sketch out what improvements I felt were needed to make sure they are aligned with the goals of the project before I invest more time in them.

  1. Indexing: There seem to be a few places where we essentially build and index of the sequences just to perform a single operation (see https://github.com/nextstrain/augur/blob/master/augur/align.py#L423, https://github.com/nextstrain/augur/blob/master/augur/align.py#L390-L407, https://github.com/nextstrain/augur/blob/master/augur/align.py#L281, https://github.com/nextstrain/augur/blob/master/augur/align.py#L209). We essentially build and index in read_sequences, but then we throw it away. I suggest keeping this index so we are working with a dictionary of SeqRecords instead of a list of SeqRecords. For existing alignments, I suggest attaching an index to the MultipleSeqAlign object (as an attribute). There are many operations that happen here that can be trivialized with an index. If memory usage is a concern, then BioPython provides a way to index a fasta file without loading it into memory, but we would be working with the id field instead of the name field and those might not always be equivalent.
  2. Reference Sequence: I suggest storing the reference sequence separate from the rest of the sequences. However, this becomes a moot point if we decide to do an index.
  3. Composing Functions: Refactor the function in align to work on a single SeqRecord instead of explicitly looping over a collection of SeqRecords. These new functions would accept a SeqRecord as input and return a new SeqRecord, instead of modifying the SeqRecord inplace. Refactoring this way would make it easier to reason about composing the functions. Then we can simply map the functions over inputs and write the result to a file.
    For example:
seqs = read_alignment(output_file)
prettify_alignment(seqs)
if ref_name:
    seqs = strip_non_references(seqs, ref_name)
if fill_gaps:
    make_gaps_ambiguous(seqs)
write_seqs(seqs, output_file)

becomes:

seqs = read_alignment(output_file)
processed = map(prettify_alignment, iter(seqs))
if ref_name:
    processed = map(strip_non_references, processed)
if fill_gaps:
    processed = map(make_gaps_ambiguous, processed)
write_seqs(processed, output_file)  # sequences are processed lazily and written to output_file
  1. Return new SeqRecords: Some functions operate inplace and other functions return new objects. I suggest making all functions return new objects.
  2. Regex instead of NumPy: The functions strip_non_references and analyze_insertions are suboptimal implementations. I tested the following rough implementations on my machine on a test alignment of 418 sequences of length 30006.
def numpy_based(aln, ref):
    ref_array = np.array(aln[ref])
    if "-" not in ref_array: 
        pass
    ungapped = ref_array != "-" 
    ref_aln_array = np.array(aln)[:,ungapped] 
    
    if False in ungapped: 
        pass
    out_seqs = []
    for seq, seq_array in zip(aln, ref_aln_array):
        seq.seq = Seq.Seq(''.join(seq_array))
        out_seqs.append(seq)
        
    if '-' in ref_array: 
        pass
    return out_seqs
    
def re_based(aln, ref):
    g = re.compile(r'[^\-]+')
    ref_seq = str(aln[ref].seq)
    matches = list(g.finditer(ref_seq))
    if not matches:
        slices = [slice(0, len(ref_seq))]
    else:
        slices = [slice(*x.span()) for x in matches]
    
    out_seqs = []
    ungapped_getter = operator.itemgetter(*slices)
    if len(slices) > 1:
        for record in aln:
            record.seq = Seq.Seq(''.join(ungapped_getter(str(record.seq))))
            out_seqs.append(record)
    else:
        for record in aln:
            record.seq = Seq.Seq(ungapped_getter(str(record.seq)))
            out_seqs.append(record)
            
    return out_seqs
%timeit numpy_based(f, 100)
17.5 s ± 181 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit re_based(f, 100)
1.4 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@huddlej
Copy link
Contributor

huddlej commented Jul 22, 2020

Thank you for this summary, @groutr! I'll review this and post some comments by the end of the week.

@groutr
Copy link
Contributor

groutr commented Aug 18, 2020

@huddlej any update on this?

@groutr
Copy link
Contributor

groutr commented Apr 21, 2021

ping @huddlej

@groutr
Copy link
Contributor

groutr commented Jun 7, 2022

@huddlej any update on this?

@huddlej
Copy link
Contributor

huddlej commented Jun 8, 2022

Hi @groutr! Sorry about the radio silence on this. In the time since we originally discussed this issue, we've replaced augur align with nextalign for analyses with larger datasets (e.g., SARS-CoV-2 and now influenza). We do hope to eventually support running nextalign from within augur align. When we eventually get to this, we'll likely refer back to your comments in this issue as an example of how to better compose the logic in the align module.

@huddlej huddlej closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2022
Repository owner moved this from Backlog to Done in Nextstrain planning (archived) Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request moderate problem Requires an average amount of work please take this issue Extra attention is needed priority: low To be resolved after high and moderate priority issues
Projects
No open projects
Development

No branches or pull requests

3 participants