augur align: Ensure reference sequence is removed when no gaps are present #456

danielsoneg · 2020-03-16T23:42:32Z

While investigating #454, I spotted a separate bug - in alignment files in which the reference sequence has no gaps, the strip_non_reference code bails out early and does not remove the reference sequence from the alignment file.

I adjusted the mafft output from the zika test data to remove the gaps in alignment from the reference sequence. Upon running the modified alignment data through the code path, I saw this error:

In [107]: alseq = augur.align.strip_non_reference("./aligned.fasta", reference, keep_reference=False)                                                                                                                        
No gaps in alignment to trim (with respect to the reference, KX369547.1)

In [108]: reference in {a.name:a for a in alseq}                                                                                                                                                                       
Out[108]: True

Patch attached corrects that bug:

In [123]: importlib.reload(align)                                                                                                                                                                                      
Out[123]: <module 'augur.align' from '/Users/eric/.venv/__Users__eric__Projects__nextstrain-b3a8/nextstrain/lib/python3.7/site-packages/augur/align.py'>

In [124]: alseq = align.strip_non_reference("./aligned.fasta", reference, keep_reference=False)                                                                                                                        
No gaps in alignment to trim (with respect to the reference, KX369547.1)

In [125]: reference in {a.name:a for a in alseq}                                                                                                                                                                       
Out[125]: False

and still keeps the reference if that's preferred:

In [126]: alseq = align.strip_non_reference("./aligned.fasta", reference, keep_reference=True)                                                                                                                         
No gaps in alignment to trim (with respect to the reference, KX369547.1)

In [127]: reference in {a.name:a for a in alseq}                                                                                                                                                                       
Out[127]: True

(Also: as mentioned in the last patch, I'm trying to find useful ways to contribute while now sheltering in place, but if having some rando on the internet airdropping patches on you is more distracting than helpful, just let me know and I can find someone else to bother.

I can also start actually commenting when I start looking at these, but I'm new enough to the codebase that I don't want to claim I can fix anything before digging into it.)

danielsoneg · 2020-03-17T00:45:21Z

Travis issue is snakemake throwing an error:

Running ./add_to_alignment/Snakefile (quietly)

Error: you need to specify the maximum number of CPU cores to be used at the same time with --cores.

Test Script Failed at Line 22

The command "(bash tests/builds/runner.sh)" exited with 2.

I can reproduce that against master - looks like as of SnakeMake 5.11.0 "--cores" is mandatory:
[https://snakemake.readthedocs.io/en/stable/project_info/history.html#id2]

tsibley · 2020-03-17T15:59:21Z

Re: the test failures: I think the Snakemake change in 5.11 is poorly considered and have asked for it to be reverted.

Version 5.11 made the --cores option mandatory, which breaks most existing invocations of snakemake in our workflows (and the workflow patterns we recommend). This is causing Travis CI failures, as noted in issue #456. The previous commit makes the same version change for setup.py for when augur is installed with pip. If Snakemake reverts this breaking change¹, then we can revert this change as well, while declaring 5.11.0 and 5.11.1 to be incompatible. ¹ snakemake/snakemake#284

huddlej · 2020-03-18T16:45:36Z

@danielsoneg Thank you so much for digging into this issue! I haven't had time to confirm that this patch fixes Trevor's specific bug, but I will try it out today.

Also, we are thrilled to have contributions on these issues from volunteers like yourself. Every effort you make on an issue helps us in some way and we really appreciate it. If you are interested in finding other open issues that we've curated for volunteer contributions, check out the open project boards for Nextstrain.

When you find an issue you'd like to help out with, go ahead and comment on the issue so we know someone is looking into it. Even if you end up not working on that issue, it will help prevent overlapping work from multiple people on the same issue.

danielsoneg · 2020-03-19T17:36:08Z

Sounds good, thanks!

huddlej · 2020-03-20T00:28:44Z

After digging into this more, I think the bug @danielsoneg identified and corrected is a separate issue that we’ve been blind to. The issue with the Zika and ncov builds appears to be a mismatch in how sequences are identified by name or id depending on their source from FASTA or GenBank files, respectively.

Here is the breakdown for Zika:

BioPython’s sequence record assigns the GenBank record accession as the record id and assigns the LOCUS as the name.
The reference in augur align is tracked by its record id and not the record name. All other sequences in augur align are tracked by their record name.
The Zika reference name is “PF13/251013_18” but its id is “KX369547.1”.
A redundant copy of the reference sequence is included in the downloaded sequences from fauna and this sequence is tracked by its name as “PF13/251013_18”.
Since augur align never knows this redundant sequence has the same name (because name != id), that sequence is never removed.
The GenBank sequence with the accession name is properly removed.

The best solution to the Zika problem might be to filter the redundant reference sequence out prior to alignment by adding it to the dropped strains file.

The same problem might occur with the ncov dataset. The original reference GenBank file had an id of “MN908947.3” and a name of “MN908947”. If the reference appeared in the input sequences, too, as “MN908947”, it would never be removed.

However, @trvrb recently updated the GenBank file for the reference such that the id and name would be identical. This change to the GenBank file should result in redundant sequences being detected prior to alignment or being removed at the end of the alignment.

It might be helpful in the long term to have augur align always use a record’s name instead of its id, to avoid these kinds of issues. It would also be helpful to have unit tests for basic alignment functionality like this.

@trvrb, is this issue still open for you now that you’ve changed the ncov reference?

I think we should still merge this PR, as it fixes a real (if infrequent) bug.

huddlej · 2020-03-20T18:59:10Z

@danielsoneg Before I merge this, can you remove the reference to fixing #454, since the issue you found is different from Trevor's original problem?

Can you also add a doctest to the strip_non_reference function to test the functionality you've fixed? You should be able to add your example FASTA file with no gaps to tests/data/ and add an example call to the function with those data to the docstring. See this example from the distance module as a starting place.

You can check your test worked by running the following command from the augur top-level directory:

pytest -c pytest.python3.ini

Thank you!

danielsoneg · 2020-03-22T01:25:13Z

Thanks @huddlej - I updated the description and added some doctests and a ludicrously limited test alignment file.

huddlej · 2020-03-24T19:30:24Z

@danielsoneg Awesome! This works great for me and is a timely fix, as we've just encountered this bug in the ncov builds.

I've just merged this after removing the "tests/data/align/.test_aligned_sequences.fasta.swp" file and squashing the separate commits into a single commit. Thank you again for your help!

Eric Danielson added 2 commits March 16, 2020 16:28

Align: Ensure reference sequence is removed

fcd03ea

Use better variable name

f461134

trvrb mentioned this pull request Mar 17, 2020

align: Run mafft with --nomemsave option #458

Merged

huddlej mentioned this pull request Mar 20, 2020

Reference sequence not removed when calling augur align #454

Closed

Add tests for strip_non_reference

fb49e39

huddlej closed this Mar 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augur align: Ensure reference sequence is removed when no gaps are present #456

augur align: Ensure reference sequence is removed when no gaps are present #456

danielsoneg commented Mar 16, 2020 •

edited

Loading

danielsoneg commented Mar 17, 2020

tsibley commented Mar 17, 2020

huddlej commented Mar 18, 2020

danielsoneg commented Mar 19, 2020

huddlej commented Mar 20, 2020

huddlej commented Mar 20, 2020

danielsoneg commented Mar 22, 2020

huddlej commented Mar 24, 2020

augur align: Ensure reference sequence is removed when no gaps are present #456

augur align: Ensure reference sequence is removed when no gaps are present #456

Conversation

danielsoneg commented Mar 16, 2020 • edited Loading

danielsoneg commented Mar 17, 2020

tsibley commented Mar 17, 2020

huddlej commented Mar 18, 2020

danielsoneg commented Mar 19, 2020

huddlej commented Mar 20, 2020

huddlej commented Mar 20, 2020

danielsoneg commented Mar 22, 2020

huddlej commented Mar 24, 2020

danielsoneg commented Mar 16, 2020 •

edited

Loading