-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
augur align: Ensure reference sequence is removed when no gaps are present #456
augur align: Ensure reference sequence is removed when no gaps are present #456
Conversation
Travis issue is snakemake throwing an error:
I can reproduce that against master - looks like as of SnakeMake 5.11.0 "--cores" is mandatory: |
Re: the test failures: I think the Snakemake change in 5.11 is poorly considered and have asked for it to be reverted. |
Version 5.11 made the --cores option mandatory, which breaks most existing invocations of snakemake in our workflows (and the workflow patterns we recommend). This is causing Travis CI failures, as noted in issue #456. The previous commit makes the same version change for setup.py for when augur is installed with pip. If Snakemake reverts this breaking change¹, then we can revert this change as well, while declaring 5.11.0 and 5.11.1 to be incompatible. ¹ snakemake/snakemake#284
@danielsoneg Thank you so much for digging into this issue! I haven't had time to confirm that this patch fixes Trevor's specific bug, but I will try it out today. Also, we are thrilled to have contributions on these issues from volunteers like yourself. Every effort you make on an issue helps us in some way and we really appreciate it. If you are interested in finding other open issues that we've curated for volunteer contributions, check out the open project boards for Nextstrain. When you find an issue you'd like to help out with, go ahead and comment on the issue so we know someone is looking into it. Even if you end up not working on that issue, it will help prevent overlapping work from multiple people on the same issue. |
Sounds good, thanks! |
After digging into this more, I think the bug @danielsoneg identified and corrected is a separate issue that we’ve been blind to. The issue with the Zika and ncov builds appears to be a mismatch in how sequences are identified by name or id depending on their source from FASTA or GenBank files, respectively. Here is the breakdown for Zika:
The best solution to the Zika problem might be to filter the redundant reference sequence out prior to alignment by adding it to the dropped strains file. The same problem might occur with the ncov dataset. The original reference GenBank file had an However, @trvrb recently updated the GenBank file for the reference such that the id and name would be identical. This change to the GenBank file should result in redundant sequences being detected prior to alignment or being removed at the end of the alignment. It might be helpful in the long term to have augur align always use a record’s @trvrb, is this issue still open for you now that you’ve changed the ncov reference? I think we should still merge this PR, as it fixes a real (if infrequent) bug. |
@danielsoneg Before I merge this, can you remove the reference to fixing #454, since the issue you found is different from Trevor's original problem? Can you also add a doctest to the You can check your test worked by running the following command from the augur top-level directory:
Thank you! |
Thanks @huddlej - I updated the description and added some doctests and a ludicrously limited test alignment file. |
@danielsoneg Awesome! This works great for me and is a timely fix, as we've just encountered this bug in the ncov builds. I've just merged this after removing the "tests/data/align/.test_aligned_sequences.fasta.swp" file and squashing the separate commits into a single commit. Thank you again for your help! |
While investigating #454, I spotted a separate bug - in alignment files in which the reference sequence has no gaps, the
strip_non_reference
code bails out early and does not remove the reference sequence from the alignment file.I adjusted the mafft output from the zika test data to remove the gaps in alignment from the reference sequence. Upon running the modified alignment data through the code path, I saw this error:
Patch attached corrects that bug:
and still keeps the reference if that's preferred:
(Also: as mentioned in the last patch, I'm trying to find useful ways to contribute while now sheltering in place, but if having some rando on the internet airdropping patches on you is more distracting than helpful, just let me know and I can find someone else to bother.
I can also start actually commenting when I start looking at these, but I'm new enough to the codebase that I don't want to claim I can fix anything before digging into it.)