Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[align] add new sequences to existing alignment #422

Merged
merged 12 commits into from
Feb 26, 2020
Merged

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Dec 12, 2019

Having a hand-curated alignment which is undesirable to change is common for certain pathogens and research groups. This commit allows augur align to add new data to an existing alignment via the --existing-alignment argument. Internally it uses mafft's --add functionality.

We also allow multiple FASTAs to be supplied as input.

This allows the common workflow of "hand-curated reference set" + "new sequences" + "more new sequences". A test build has been added which does just this.

Smaller changes

  • The trimming of gaps w.r.t. reference is not done if there are no gaps introduced in the reference & the message shown here indicates when this is so.
  • More error checking
  • No longer mention "VCF" in the help -- the code could not handle this and it doesn't really make sense. (The two test builds using VCF files do not use augur align.)
  • Improved help message to indicate that --fill-gaps only works when specifying a reference. Unclear whether this is the desired intention but it reflects what the code does. _UPDATE: --fill-gaps now works without needing a reference sequence.
  • Code refactoring into functions
  • Exit early if errors detected (e.g. you specified a reference name but it wasn't in the sequences). This follows other work in augur.
  • UPDATE: debugging files (e.g. pre- and post-aligner fastas) are only produced if asked for via --debug

If a reference sequence file is provided, we create a temporary file containing the input sequences + the reference. This commit removes this file upon completion. This will become important as we extend the functionality to add sequences to an additional alignment to prevent temporary files in the "data" directory in a typical nextstrain directory layout.
It's preferable to exit early if the augur command contains errors (such as a sequence name which isn't in the sample set) rather than printing a warning which is often ignored in a Snakemake-style workflow
Refactors alignment code into a series of functions, each of which can cause the command to fail by throwing a custom exception. This is necessary to allow future implementation of add-to-align functionality without causing the code flow to become too complex. Also improves some help messages & adds a few more error checks.
The help message incorrectly stated that a VCF file could be supplied to `augur align`, but the code could not handle this. The two test builds using VCF files do not use `augur align`.
Having a hand-curated alignment which is undesirable to change is common for certain pathogens. This commit allows `augur align` to add new data to an existing alignment. 

The trimming of gaps w.r.t. reference is not done if there are no gaps introduced in the reference & the message shown here indicates when this is so.

Test build added to test adding sequences to an existing alignment.

See https://mafft.cbrc.jp/alignment/software/addsequences.html for more information about the alignment algorithm.
A common starting point for analyses is to align multiple different files. This is essentially just `cat`ing them together, but with more error checking.
@emmahodcroft
Copy link
Member

Thanks @jameshadfield, this is a really useful addition, and one we've had questions about in the past!

Copy link
Member

@rneher rneher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner than before -- thank you very much. I spotted one small error and made a few other suggestions... other wise good.

augur/align.py Outdated Show resolved Hide resolved
augur/align.py Show resolved Hide resolved
augur/align.py Outdated Show resolved Hide resolved
augur/align.py Outdated Show resolved Hide resolved
augur/align.py Outdated Show resolved Hide resolved
This is a change in behavior from current augur releases, but is sensible and was requested in #422
We now only produce extra files (e.g. pre- and post-aligner files, which can help with debugging poor alignments) if the `--debug` flag is set. This was requested in #422
These are often covered up by Snakemake which tends to make the directories as needed for output files.
parser.add_argument('--sequences', '-s', required=True, help="sequences in fasta or VCF format")
parser.add_argument('--output', '-o', help="output file")
parser.add_argument('--sequences', '-s', required=True, nargs="+", metavar="FASTA", help="sequences to align")
parser.add_argument('--output', '-o', default="alignment.fasta", help="output file (default: %(default)s)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea to have defaults for --output, but strange that augur align is the only command to get this. Would be the plan be to add defaults to all the other --output arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality wasn't actually added in this PR (it was set in an if/else clause at lines 57-60), this syntax is simply a nicer way to express this. Happy for the default to be removed via this PR if desired?

augur/align.py Outdated Show resolved Hide resolved
@trvrb
Copy link
Member

trvrb commented Dec 31, 2019

Looks generally great @jameshadfield in terms of implementation (I just found the one small typo in testing). However, something to note. I was testing this with the https://github.com/inrb-drc/ebola-nord-kivu alignment issue and it's not a magic bullet. I can construct a good alignment for the majority of sequences and then when adding the few problem sequences, the resulting alignment still has the same issues for these problem sequences. However, this should still be a nice feature for people with curated alignments so that MAFFT doesn't break the curation that has occurred, even if more will be necessary.

I might give some thought as to recommended snakemake workflows when using --existing-alignment. For starters, I assume one would move aligned.fasta from results/ to data/ and rename it something like data/curated-alignment.fasta for clarity.

One major workflow issue:

If I make a call to --existing-alignment curated-alignment.fasta with say 398 of my 400 strains. And then try to run a build with all 400 entering as --sequences then I get the error:

Duplicate strains of "outgroup" detected

augur align should allow the same strains to appear both in --existing-alignment and in --sequences. If they are in --existing-alignment then remove from --sequences before handing to MAFFT.

A common workflow may include sequences "twice" - in  multiple files provided as sequences or in the alignment & a sequence file. Here we prune out such duplicates and print notifications to the screen.
@jameshadfield
Copy link
Member Author

Thanks @trvrb

If I make a call to --existing-alignment curated-alignment.fasta with say 398 of my 400 strains...

This functionality has now been added by ef029f9, which will also allow multiple sequence FASTAs to contain duplicates.

I might give some thought as to recommended snakemake workflows when using --existing-alignment. For starters, I assume one would move aligned.fasta from results/ to data/ and rename it something like data/curated-alignment.fasta for clarity.

Pretty much. I was planning to do this with the ebola dataset, and think that I still will, but it seems that this PR won't solve all the problems we encounter.

@jameshadfield jameshadfield requested a review from trvrb January 7, 2020 21:06
@jameshadfield
Copy link
Member Author

@trvrb any objections to merging this?

@trvrb
Copy link
Member

trvrb commented Feb 10, 2020

No objections. Thanks for attending to my original concerns.

@jameshadfield jameshadfield merged commit 207f9a3 into master Feb 26, 2020
@jameshadfield jameshadfield deleted the add-to-alignment branch February 26, 2020 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants