[align] add new sequences to existing alignment #422

jameshadfield · 2019-12-12T00:24:33Z

Having a hand-curated alignment which is undesirable to change is common for certain pathogens and research groups. This commit allows augur align to add new data to an existing alignment via the --existing-alignment argument. Internally it uses mafft's --add functionality.

We also allow multiple FASTAs to be supplied as input.

This allows the common workflow of "hand-curated reference set" + "new sequences" + "more new sequences". A test build has been added which does just this.

Smaller changes

The trimming of gaps w.r.t. reference is not done if there are no gaps introduced in the reference & the message shown here indicates when this is so.
More error checking
No longer mention "VCF" in the help -- the code could not handle this and it doesn't really make sense. (The two test builds using VCF files do not use augur align.)
~~Improved help message to indicate that --fill-gaps only works when specifying a reference. Unclear whether this is the desired intention but it reflects what the code does.~~ _UPDATE: --fill-gaps now works without needing a reference sequence.
Code refactoring into functions
Exit early if errors detected (e.g. you specified a reference name but it wasn't in the sequences). This follows other work in augur.
UPDATE: debugging files (e.g. pre- and post-aligner fastas) are only produced if asked for via --debug

If a reference sequence file is provided, we create a temporary file containing the input sequences + the reference. This commit removes this file upon completion. This will become important as we extend the functionality to add sequences to an additional alignment to prevent temporary files in the "data" directory in a typical nextstrain directory layout.

It's preferable to exit early if the augur command contains errors (such as a sequence name which isn't in the sample set) rather than printing a warning which is often ignored in a Snakemake-style workflow

Refactors alignment code into a series of functions, each of which can cause the command to fail by throwing a custom exception. This is necessary to allow future implementation of add-to-align functionality without causing the code flow to become too complex. Also improves some help messages & adds a few more error checks.

The help message incorrectly stated that a VCF file could be supplied to `augur align`, but the code could not handle this. The two test builds using VCF files do not use `augur align`.

Having a hand-curated alignment which is undesirable to change is common for certain pathogens. This commit allows `augur align` to add new data to an existing alignment. The trimming of gaps w.r.t. reference is not done if there are no gaps introduced in the reference & the message shown here indicates when this is so. Test build added to test adding sequences to an existing alignment. See https://mafft.cbrc.jp/alignment/software/addsequences.html for more information about the alignment algorithm.

A common starting point for analyses is to align multiple different files. This is essentially just `cat`ing them together, but with more error checking.

emmahodcroft · 2019-12-12T10:44:09Z

Thanks @jameshadfield, this is a really useful addition, and one we've had questions about in the past!

rneher

This is much cleaner than before -- thank you very much. I spotted one small error and made a few other suggestions... other wise good.

augur/align.py

See review at #422

This is a change in behavior from current augur releases, but is sensible and was requested in #422

We now only produce extra files (e.g. pre- and post-aligner files, which can help with debugging poor alignments) if the `--debug` flag is set. This was requested in #422

These are often covered up by Snakemake which tends to make the directories as needed for output files.

trvrb · 2019-12-31T00:44:14Z

augur/align.py

-    parser.add_argument('--sequences', '-s', required=True, help="sequences in fasta or VCF format")
-    parser.add_argument('--output', '-o', help="output file")
+    parser.add_argument('--sequences', '-s', required=True, nargs="+", metavar="FASTA", help="sequences to align")
+    parser.add_argument('--output', '-o', default="alignment.fasta", help="output file (default: %(default)s)")


Not a bad idea to have defaults for --output, but strange that augur align is the only command to get this. Would be the plan be to add defaults to all the other --output arguments?

This functionality wasn't actually added in this PR (it was set in an if/else clause at lines 57-60), this syntax is simply a nicer way to express this. Happy for the default to be removed via this PR if desired?

augur/align.py

trvrb · 2019-12-31T01:40:52Z

Looks generally great @jameshadfield in terms of implementation (I just found the one small typo in testing). However, something to note. I was testing this with the https://github.com/inrb-drc/ebola-nord-kivu alignment issue and it's not a magic bullet. I can construct a good alignment for the majority of sequences and then when adding the few problem sequences, the resulting alignment still has the same issues for these problem sequences. However, this should still be a nice feature for people with curated alignments so that MAFFT doesn't break the curation that has occurred, even if more will be necessary.

I might give some thought as to recommended snakemake workflows when using --existing-alignment. For starters, I assume one would move aligned.fasta from results/ to data/ and rename it something like data/curated-alignment.fasta for clarity.

One major workflow issue:

If I make a call to --existing-alignment curated-alignment.fasta with say 398 of my 400 strains. And then try to run a build with all 400 entering as --sequences then I get the error:

Duplicate strains of "outgroup" detected

augur align should allow the same strains to appear both in --existing-alignment and in --sequences. If they are in --existing-alignment then remove from --sequences before handing to MAFFT.

A common workflow may include sequences "twice" - in multiple files provided as sequences or in the alignment & a sequence file. Here we prune out such duplicates and print notifications to the screen.

jameshadfield · 2020-01-01T23:23:50Z

Thanks @trvrb

If I make a call to --existing-alignment curated-alignment.fasta with say 398 of my 400 strains...

This functionality has now been added by ef029f9, which will also allow multiple sequence FASTAs to contain duplicates.

I might give some thought as to recommended snakemake workflows when using --existing-alignment. For starters, I assume one would move aligned.fasta from results/ to data/ and rename it something like data/curated-alignment.fasta for clarity.

Pretty much. I was planning to do this with the ebola dataset, and think that I still will, but it seems that this PR won't solve all the problems we encounter.

jameshadfield · 2020-02-10T22:51:14Z

@trvrb any objections to merging this?

trvrb · 2020-02-10T23:31:22Z

No objections. Thanks for attending to my original concerns.

jameshadfield added 6 commits December 11, 2019 17:25

[align] exit if arguments are invalid

c7a9013

It's preferable to exit early if the augur command contains errors (such as a sequence name which isn't in the sample set) rather than printing a warning which is often ignored in a Snakemake-style workflow

[align] remove help message indicating VCF support

45cd18e

The help message incorrectly stated that a VCF file could be supplied to `augur align`, but the code could not handle this. The two test builds using VCF files do not use `augur align`.

[align] allow multiple input sequences

7e3c312

A common starting point for analyses is to align multiple different files. This is essentially just `cat`ing them together, but with more error checking.

jameshadfield requested review from trvrb and rneher December 12, 2019 00:24

rneher requested changes Dec 20, 2019

View reviewed changes

augur/align.py Outdated Show resolved Hide resolved

augur/align.py Show resolved Hide resolved

augur/align.py Outdated Show resolved Hide resolved

augur/align.py Outdated Show resolved Hide resolved

augur/align.py Outdated Show resolved Hide resolved

jameshadfield added 4 commits December 23, 2019 09:59

[align] 2 minor fixes

ca71c4f

See review at #422

[align] allow gaps to be filled when reference not provided

47fde1a

This is a change in behavior from current augur releases, but is sensible and was requested in #422

[align] require --debug flag for extra output files

05864fa

We now only produce extra files (e.g. pre- and post-aligner files, which can help with debugging poor alignments) if the `--debug` flag is set. This was requested in #422

[align] catch write errors to a non-existing directory

e864051

These are often covered up by Snakemake which tends to make the directories as needed for output files.

trvrb reviewed Dec 31, 2019

View reviewed changes

trvrb requested changes Dec 31, 2019

View reviewed changes

augur/align.py Outdated Show resolved Hide resolved

jameshadfield added 2 commits January 2, 2020 10:26

[align] fix typo

b9f7300

[align] handle duplicate sequences

ef029f9

A common workflow may include sequences "twice" - in multiple files provided as sequences or in the alignment & a sequence file. Here we prune out such duplicates and print notifications to the screen.

jameshadfield requested a review from trvrb January 7, 2020 21:06

jameshadfield merged commit 207f9a3 into master Feb 26, 2020

jameshadfield deleted the add-to-alignment branch February 26, 2020 04:54

CameronDevine mentioned this pull request Mar 7, 2020

Alignment Name with Spaces Breaks mafft Call #388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[align] add new sequences to existing alignment #422

[align] add new sequences to existing alignment #422

jameshadfield commented Dec 12, 2019 •

edited

Loading

emmahodcroft commented Dec 12, 2019

rneher left a comment

trvrb Dec 31, 2019

jameshadfield Jan 1, 2020

trvrb commented Dec 31, 2019 •

edited

Loading

jameshadfield commented Jan 1, 2020

jameshadfield commented Feb 10, 2020

trvrb commented Feb 10, 2020

[align] add new sequences to existing alignment #422

[align] add new sequences to existing alignment #422

Conversation

jameshadfield commented Dec 12, 2019 • edited Loading

Smaller changes

emmahodcroft commented Dec 12, 2019

rneher left a comment

Choose a reason for hiding this comment

trvrb Dec 31, 2019

Choose a reason for hiding this comment

jameshadfield Jan 1, 2020

Choose a reason for hiding this comment

trvrb commented Dec 31, 2019 • edited Loading

jameshadfield commented Jan 1, 2020

jameshadfield commented Feb 10, 2020

trvrb commented Feb 10, 2020

jameshadfield commented Dec 12, 2019 •

edited

Loading

trvrb commented Dec 31, 2019 •

edited

Loading