Skip to content

Reconciling contigs

Ryan Wick edited this page Jun 19, 2020 · 48 revisions

Requirements

Before running this step, you'll need to have completed the previous one (clustering contigs). I.e. you should have a Trycycler output directory (which I'll assume is called trycycler) with subdirectories for each of your good clusters, each of which contains a 1_contigs subdirectory. For convenience, you should probably also have deleted any subdirectories for bad clusters.

You'll also need the long-read set you used in the previous step, which I'll assume is in reads.fastq.

Concept

Now that you have contig clusters to work on, Trycycler needs to reconcile the cluster contigs. This step is done per-cluster, so if your assemblies yielded three good contig clusters (e.g. one chromosome and two plasmids) then you will carry out this step on each of them.

Trycycler reconcile will:

  • Perform an initial check to make sure the contigs look sufficiently similar to each other:
    • relative lengths must be fairly close (e.g. one contig can't be twice as long as another)
    • Mash distances must be quite small
  • Ensure that all contig sequences are on the same strand
  • If the replicon is circular:
    • Fix any circularisation issues (i.e. add/remove sequence at each contig's start/end as necessary)
    • Rotate the contigs to a common starting sequence
  • Perform a final check to make sure the normalised/circularised contigs are sufficiently similar to each other for the next step (multiple sequence alignment)

If all goes well, this step will run and complete on its own. However, it may not and you'll need to manually intervene, for example deleting a contig sequence that is causing problems.

Running Trycycler reconcile

The Trycycler reconcile command must be run separately for each of your good clusters. Assuming your good clusters are numbers 1, 7 and 8, these are the commands you would run:

trycycler reconcile --reads reads.fastq --cluster_dir trycycler/cluster_001
trycycler reconcile --reads reads.fastq --cluster_dir trycycler/cluster_002
trycycler reconcile --reads reads.fastq --cluster_dir trycycler/cluster_003

Trycycler msa should take a few minutes to complete – less for small contigs. Longer sequences and larger numbers of sequences will be slower, so a large bacterial chromosome with a lot of input contigs might take

Settings

General settings:

  • --linear: use this option if your input contigs are not circular. It will disable the circularisation-correction steps in Trycycler reconcile.
  • --threads: this is how many threads Trycycler will use for read alignment. It will only affect the speed performance, so you'll probably want to use as many threads as you have available.
  • --verbose: use this flag to display extra output. Only really there for debugging purposes.

Initial check:

  • --max_mash_dist: if any of the sequences have a pairwise Mash distance of more than this (default = 0.02), then the contigs will fail the initial check.
  • --max_length_diff: if any of the sequences have a pairwise relative length factor of more than this, then the contigs will fail the initial check. For example, if set to 1.1 (the default), then no contig can be more than 10% longer than any other.

Circularisation:

  • --max_add_seq and --max_add_seq_percent: these control how much sequence Trycycler is willing to add to a contig to circularise it. For example, if they are set to 1000 and 5 (the defaults), then Trycycler will be willing to add up to 1000 bp or 5% of a contig's length (whichever is smaller) to circularise it. Any contig which requires more than 1000 bp or 5% of its length added to circularise will cause Trycycler reconcile to fail.
  • --max_trim_seq and --max_trim_seq_percent: these control how much sequence Trycycler is willing to remove from a contig to circularise it. For example, if they are set to 50000 and 10 (the defaults), then Trycycler will be willing to remove up to 50000 bp or 10% of a contig's length (whichever is smaller) to circularise it. Any contig which requires more than 50000 bp or 10% of its length removed to circularise will cause Trycycler reconcile to fail.

Final check:

  • --min_identity: if any of the sequences have a pairwise global alignment percent identity of less than this (default = 98), then the contigs will fail the final check.
  • --max_indel_size: if any of the sequences have a pairwise alignment indel size of more than this (default = 250), then the contigs will fail the final check.

Output

When finished, Trycycler reconcile will make 2_all_seqs.fasta in the cluster directory, a multi-FASTA file containing each of the contigs ready for multiple sequence alignment.

Manual intervention

Trycycler reconcile may not complete successfully, in which case you will have to intervene and run it again. Usually this means simply excluding whichever contig is causing the problem (by deleting its FASTA file from the 1_contigs directory) and running trycycler reconcile again with the remaining contigs. E.g. if you have eight input contigs for a cluster and one is preventing trycycler reconcile from completing (for one of the reasons listed below) you can just delete it and run trycycler reconcile with the remaining seven contigs.

Don't let this worry you – throwing out troublesome contigs at this step is normal. One of the reasons you prepared so many redundant assemblies was so you could have enough contigs in our clusters that losing one or two is not a problem.

There are a number of reasons Trycycler reconcile might fail listed below, along with possible actions to take:

Failed initial check

Ideally all the contig sequences will be approximately the same length. Some length variation is normal, due to the accumulation of small indels and/or circularisation problems (missing or duplicated sequence at the starts/ends). Trycycler reconcile will tolerate a bit of length difference, but if there is too much, it will refuse to continue. Similarly, if any of the pairwise Mash distances between contigs is too large, Trycycler reconcile will refuse to continue.

If this happens, the simplest is to just delete the offending contig(s) and run Trycycler reconcile again. This option is good if you have lots of input contigs so the loss of one or two won't be a problem. Alternatively, you can try to repair the offending contig(s). For example, if one contig is too long, you can try to fix it (remove the excess sequence) and try again. See manually fixing overlap for suggestions on how this might work.

Length problems are especially common with Canu assemblies and small plasmids. In small plasmids, a modest absolute amount of overlap can create a very large amount of relative overlap, and some assemblers will occasionally duplicate an entire small plasmid sequence in a single contig, making it twice as long as it should be.

Unable to circularise

Trycycler reconcile will try to circularise each contig using the other contigs as a reference. It does this by looking for the contig's start and end sequences in the other contig and then deciding if it needs to add or remove sequence to make it circular. If Trycycler reconcile cannot find the start/end sequences in other contigs or if it find multiple instances of those sequences, it will not be able to continue.

Failed final check

Trycycler reconcile performs a global alignment between all pairs of final sequences at the end of its execution. If any pair has a particularly bad pairwise identity or large insertion/deletion, that indicates a significant problem (e.g. a misassembly) with one of the sequences.

If this happens, you have two options. If one sequence in particular is causing the problem, it is probably best to delete that sequence and run Trycycler reconcile again. This is especially true if you have a good number of input contigs (e.g. six or more). Alternatively, if a large number of sequences are failing and/or you don't have very many input contigs, you can decrease the --min_identity and increase the --max_indel_size parameters to make Trycycler reconcile's final check more tolerant.

Ideal number of contigs

You should aim to have around four to six contigs left after running Trycycler align. Less than that (two or three) will not provide as many variants for the next steps and may affect your consensus sequence quality. More than that (nine or more) is excessive and probably won't be of any benefit.

If you have too few contigs for your cluster, you might want to consider going back to the start and generating more input assemblies.

If you have too many contigs, you can delete some of the worst ones. Use Trycycler reconcile's final check to guide you: delete the contigs with the lowest identities and largest indels relative to the other contigs.

For example, consider a Trycycler reconcile run where the final results look like this:

Pairwise identities:
  A_tig00000001:  100.00%  100.00%   99.99%   99.93%   99.99%  100.00%   99.89%
  B_contig_1:     100.00%  100.00%   99.99%   99.92%   99.99%   99.99%   99.89%
  C_utg000001c:    99.99%   99.99%  100.00%   99.92%   99.99%   99.99%   99.88%
  D_bctg00000000:  99.93%   99.92%   99.92%  100.00%   99.92%   99.92%   99.82%
  E_Utg0:          99.99%   99.99%   99.99%   99.92%  100.00%   99.99%   99.88%
  F_ctg1:         100.00%   99.99%   99.99%   99.92%   99.99%  100.00%   99.89%
  G_0:             99.89%   99.89%   99.88%   99.82%   99.88%   99.89%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001:   0   1   4  21   9   3  40
  B_contig_1:      1   0   4  21   9   2  40
  C_utg000001c:    4   4   0  21   9   4  40
  D_bctg00000000: 21  21  21   0  21  21  40
  E_Utg0:          9   9   9  21   0   9  40
  F_ctg1:          3   2   4  21   9   0  40
  G_0:            40  40  40  40  40  40   0

Even though all sequences have technically passed (i.e. haven't failed the --min_identity and --max_indel_size settings), two of them (D_bctg00000000 and G_0) are clearly worse than the rest. I would therefore exclude those two and run Trycycler align again with the remaining five contigs.

Clone this wiki locally