-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce number of iterations over aligned sequences #465
Comments
EDIT: nvm. I guess I'll give feedback on the existing PR. |
@tolot27 @groutr If either of you are still interested in working on this issue, the most relevant data to test with might be recent SARS-CoV-2 sequences described in the ncov repo docs. If you are interested, let me know if there is anyway I can help. |
@huddlej I'll take a look at this. |
Thank you, @groutr! |
@huddlej Hi. I've been looking at this in my free time over the past week or so and simply wanted to sketch out what improvements I felt were needed to make sure they are aligned with the goals of the project before I invest more time in them.
seqs = read_alignment(output_file)
prettify_alignment(seqs)
if ref_name:
seqs = strip_non_references(seqs, ref_name)
if fill_gaps:
make_gaps_ambiguous(seqs)
write_seqs(seqs, output_file) becomes: seqs = read_alignment(output_file)
processed = map(prettify_alignment, iter(seqs))
if ref_name:
processed = map(strip_non_references, processed)
if fill_gaps:
processed = map(make_gaps_ambiguous, processed)
write_seqs(processed, output_file) # sequences are processed lazily and written to output_file
def numpy_based(aln, ref):
ref_array = np.array(aln[ref])
if "-" not in ref_array:
pass
ungapped = ref_array != "-"
ref_aln_array = np.array(aln)[:,ungapped]
if False in ungapped:
pass
out_seqs = []
for seq, seq_array in zip(aln, ref_aln_array):
seq.seq = Seq.Seq(''.join(seq_array))
out_seqs.append(seq)
if '-' in ref_array:
pass
return out_seqs
def re_based(aln, ref):
g = re.compile(r'[^\-]+')
ref_seq = str(aln[ref].seq)
matches = list(g.finditer(ref_seq))
if not matches:
slices = [slice(0, len(ref_seq))]
else:
slices = [slice(*x.span()) for x in matches]
out_seqs = []
ungapped_getter = operator.itemgetter(*slices)
if len(slices) > 1:
for record in aln:
record.seq = Seq.Seq(''.join(ungapped_getter(str(record.seq))))
out_seqs.append(record)
else:
for record in aln:
record.seq = Seq.Seq(ungapped_getter(str(record.seq)))
out_seqs.append(record)
return out_seqs
|
Thank you for this summary, @groutr! I'll review this and post some comments by the end of the week. |
@huddlej any update on this? |
ping @huddlej |
@huddlej any update on this? |
Hi @groutr! Sorry about the radio silence on this. In the time since we originally discussed this issue, we've replaced augur align with nextalign for analyses with larger datasets (e.g., SARS-CoV-2 and now influenza). We do hope to eventually support running nextalign from within augur align. When we eventually get to this, we'll likely refer back to your comments in this issue as an example of how to better compose the logic in the align module. |
This issue is extracted from the original post in #462 (comment)_ and should address a slightly increasing performance degradation:
Iterating over all aligned sequences three times, modifying it (uppercasing, gap replacement) and writing it to disk is maybe slow for large alignments and wastes I/O. A fourth iteration was recently added with commit e3d1848 in line
augur/augur/align.py
Line 242 in e3d1848
The text was updated successfully, but these errors were encountered: