-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[align] add new sequences to existing alignment #422
Conversation
If a reference sequence file is provided, we create a temporary file containing the input sequences + the reference. This commit removes this file upon completion. This will become important as we extend the functionality to add sequences to an additional alignment to prevent temporary files in the "data" directory in a typical nextstrain directory layout.
It's preferable to exit early if the augur command contains errors (such as a sequence name which isn't in the sample set) rather than printing a warning which is often ignored in a Snakemake-style workflow
Refactors alignment code into a series of functions, each of which can cause the command to fail by throwing a custom exception. This is necessary to allow future implementation of add-to-align functionality without causing the code flow to become too complex. Also improves some help messages & adds a few more error checks.
The help message incorrectly stated that a VCF file could be supplied to `augur align`, but the code could not handle this. The two test builds using VCF files do not use `augur align`.
Having a hand-curated alignment which is undesirable to change is common for certain pathogens. This commit allows `augur align` to add new data to an existing alignment. The trimming of gaps w.r.t. reference is not done if there are no gaps introduced in the reference & the message shown here indicates when this is so. Test build added to test adding sequences to an existing alignment. See https://mafft.cbrc.jp/alignment/software/addsequences.html for more information about the alignment algorithm.
A common starting point for analyses is to align multiple different files. This is essentially just `cat`ing them together, but with more error checking.
Thanks @jameshadfield, this is a really useful addition, and one we've had questions about in the past! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner than before -- thank you very much. I spotted one small error and made a few other suggestions... other wise good.
See review at #422
This is a change in behavior from current augur releases, but is sensible and was requested in #422
We now only produce extra files (e.g. pre- and post-aligner files, which can help with debugging poor alignments) if the `--debug` flag is set. This was requested in #422
These are often covered up by Snakemake which tends to make the directories as needed for output files.
parser.add_argument('--sequences', '-s', required=True, help="sequences in fasta or VCF format") | ||
parser.add_argument('--output', '-o', help="output file") | ||
parser.add_argument('--sequences', '-s', required=True, nargs="+", metavar="FASTA", help="sequences to align") | ||
parser.add_argument('--output', '-o', default="alignment.fasta", help="output file (default: %(default)s)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea to have defaults for --output
, but strange that augur align
is the only command to get this. Would be the plan be to add defaults to all the other --output
arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality wasn't actually added in this PR (it was set in an if/else clause at lines 57-60), this syntax is simply a nicer way to express this. Happy for the default to be removed via this PR if desired?
Looks generally great @jameshadfield in terms of implementation (I just found the one small typo in testing). However, something to note. I was testing this with the https://github.com/inrb-drc/ebola-nord-kivu alignment issue and it's not a magic bullet. I can construct a good alignment for the majority of sequences and then when adding the few problem sequences, the resulting alignment still has the same issues for these problem sequences. However, this should still be a nice feature for people with curated alignments so that MAFFT doesn't break the curation that has occurred, even if more will be necessary. I might give some thought as to recommended snakemake workflows when using One major workflow issue: If I make a call to
|
A common workflow may include sequences "twice" - in multiple files provided as sequences or in the alignment & a sequence file. Here we prune out such duplicates and print notifications to the screen.
Thanks @trvrb
This functionality has now been added by ef029f9, which will also allow multiple sequence FASTAs to contain duplicates.
Pretty much. I was planning to do this with the ebola dataset, and think that I still will, but it seems that this PR won't solve all the problems we encounter. |
@trvrb any objections to merging this? |
No objections. Thanks for attending to my original concerns. |
Having a hand-curated alignment which is undesirable to change is common for certain pathogens and research groups. This commit allows
augur align
to add new data to an existing alignment via the--existing-alignment
argument. Internally it uses mafft's--add
functionality.We also allow multiple FASTAs to be supplied as input.
This allows the common workflow of "hand-curated reference set" + "new sequences" + "more new sequences". A test build has been added which does just this.
Smaller changes
augur align
.)Improved help message to indicate that_UPDATE:--fill-gaps
only works when specifying a reference. Unclear whether this is the desired intention but it reflects what the code does.--fill-gaps
now works without needing a reference sequence.--debug