-
Notifications
You must be signed in to change notification settings - Fork 28
Reconciling contigs
Before running this step, you'll need to have completed the previous one (clustering contigs). I.e. you should have a Trycycler output directory (which I'll assume is called trycycler
) with subdirectories for each of your good clusters, each of which contains a 1_contigs
subdirectory.
You'll also need the long-read set you used in the previous step, which I'll assume is in reads.fastq.gz
.
Now that you have contig clusters to work on, Trycycler needs to reconcile the cluster contigs. This step is done per-cluster, so if your assemblies yielded three good contig clusters (e.g. one chromosome and two plasmids) then you will carry out this step on each of them.
Trycycler reconcile will:
- Perform an initial check to make sure the contigs look sufficiently similar to each other:
- relative lengths must be fairly close (e.g. one contig can't be twice as long as another)
- Mash distances must be quite small
- Ensure that all contig sequences are on the same strand
- If the replicon is circular:
- Fix any circularisation issues (i.e. add/remove sequence at each contig's start/end as necessary)
- Rotate the contigs to a common starting sequence
- Perform a final check to make sure the normalised/circularised contigs are sufficiently similar to each other for the next step (multiple sequence alignment)
If all goes well, this step will run and complete on its own. However, it may not and you'll need to manually intervene, for example deleting a contig sequence that is causing problems.
The Trycycler reconcile command must be run separately for each of your good clusters. Assuming your good clusters are numbers 1, 7 and 8, these are the commands you would run:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_007
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_008
Trycycler msa should take a few minutes to complete – less for small contigs. Longer sequences and larger numbers of sequences will be slower, so a large bacterial chromosome with a lot of input contigs might take
-
--linear
: use this option if your input contigs are not circular. It will disable the circularisation-correction steps in Trycycler reconcile. -
--threads
: this is how many threads Trycycler will use for read alignment. It will only affect the speed performance, so you'll probably want to use as many threads as you have available. -
--verbose
: use this flag to display extra output. Only really there for debugging purposes.
-
--max_mash_dist
: if any of the sequences have a pairwise Mash distance of more than this (default = 0.02), then the contigs will fail the initial check. -
--max_length_diff
: if any of the sequences have a pairwise relative length factor of more than this, then the contigs will fail the initial check. For example, if set to 1.1 (the default), then no contig can be more than 10% longer than any other.
-
--max_add_seq
and--max_add_seq_percent
: these control how much sequence Trycycler is willing to add to a contig to circularise it. For example, if they are set to 1000 and 5 (the defaults), then Trycycler will be willing to add up to 1000 bp or 5% of a contig's length (whichever is smaller) to circularise it. Any contig which requires more than 1000 bp or 5% of its length added to circularise will cause Trycycler reconcile to fail. -
--max_trim_seq
and--max_trim_seq_percent
: these control how much sequence Trycycler is willing to remove from a contig to circularise it. For example, if they are set to 50000 and 10 (the defaults), then Trycycler will be willing to remove up to 50000 bp or 10% of a contig's length (whichever is smaller) to circularise it. Any contig which requires more than 50000 bp or 10% of its length removed to circularise will cause Trycycler reconcile to fail.
-
--min_identity
: if any of the sequences have a pairwise global alignment percent identity of less than this (default = 98), then the contigs will fail the final check. -
--max_indel_size
: if any of the sequences have a pairwise alignment indel size of more than this (default = 250), then the contigs will fail the final check.
When finished, Trycycler reconcile will make 2_all_seqs.fasta
in the cluster directory, a multi-FASTA file containing each of the contigs ready for multiple sequence alignment.
Trycycler reconcile may not complete successfully, in which case you will have to intervene and run it again. Usually this means simply excluding whichever contig is causing the problem (by deleting its FASTA file from the 1_contigs
directory) and running trycycler reconcile
again with the remaining contigs. E.g. if you have eight input contigs for a cluster and one is preventing trycycler reconcile
from completing (for one of the reasons listed below) you can just delete it and run trycycler reconcile
with the remaining seven contigs.
Don't let this worry you – throwing out troublesome contigs at this step is normal. One of the reasons you prepared so many redundant assemblies was so you could have enough contigs in our clusters that losing one or two is not a problem.
There are a number of reasons Trycycler reconcile might fail listed below, along with possible actions to take:
Ideally all the contig sequences will be approximately the same length. Some length variation is normal, due to the accumulation of small indels and/or circularisation problems (missing or duplicated sequence at the starts/ends). Trycycler reconcile will tolerate a bit of length difference, but if there is too much, it will refuse to continue. Similarly, if any of the pairwise Mash distances between contigs is too large, Trycycler reconcile will refuse to continue.
If this happens, the simplest is to just delete the offending contig(s) and run Trycycler reconcile again. This option is good if you have lots of input contigs so the loss of one or two won't be a problem. Alternatively, you can try to repair the offending contig(s). For example, if one contig is too long, you can try to fix it (remove the excess sequence) and try again. See manually fixing overlap for suggestions on how this might work.
Length problems are especially common with Canu assemblies and small plasmids. In small plasmids, a modest absolute amount of overlap can create a very large amount of relative overlap, and some assemblers will occasionally duplicate an entire small plasmid sequence in a single contig, making it twice as long as it should be.
Trycycler reconcile will try to circularise each contig using the other contigs as a reference. It does this by looking for the contig's start and end sequences in the other contig and then deciding if it needs to add or remove sequence to make it circular. If Trycycler reconcile cannot find the start/end sequences in other contigs or if it find multiple instances of those sequences, it will not be able to continue.
Trycycler reconcile performs a global alignment between all pairs of final sequences at the end of its execution. If any pair has a particularly bad pairwise identity or large insertion/deletion, that indicates a significant problem (e.g. a misassembly) with one of the sequences.
If this happens, you have two options. If one sequence in particular is causing the problem, it is probably best to delete that sequence and run Trycycler reconcile again. This is especially true if you have a good number of input contigs (e.g. six or more). Alternatively, if a large number of sequences are failing and/or you don't have very many input contigs, you can decrease the --min_identity
and increase the --max_indel_size
parameters to make Trycycler reconcile's final check more tolerant.
You should aim to have around four to six contigs left after running Trycycler align. Less than that (two or three) will not provide as many variants for the next steps and may affect your consensus sequence quality. More than that (nine or more) is excessive and probably won't be of any benefit.
If you have too few contigs for your cluster, you might want to consider going back to the start and generating more input assemblies.
If you have too many contigs, you can delete some of the worst ones. Use Trycycler reconcile's final check to guide you: delete the contigs with the lowest identities and largest indels relative to the other contigs.
For example, consider a Trycycler reconcile run where the final results look like this:
Pairwise identities:
A_tig00000001: 100.00% 100.00% 99.99% 99.93% 99.99% 100.00% 99.89%
B_contig_1: 100.00% 100.00% 99.99% 99.92% 99.99% 99.99% 99.89%
C_utg000001c: 99.99% 99.99% 100.00% 99.92% 99.99% 99.99% 99.88%
D_bctg00000000: 99.93% 99.92% 99.92% 100.00% 99.92% 99.92% 99.82%
E_Utg0: 99.99% 99.99% 99.99% 99.92% 100.00% 99.99% 99.88%
F_ctg1: 100.00% 99.99% 99.99% 99.92% 99.99% 100.00% 99.89%
G_0: 99.89% 99.89% 99.88% 99.82% 99.88% 99.89% 100.00%
Maximum insertion/deletion sizes:
A_tig00000001: 0 1 4 21 9 3 40
B_contig_1: 1 0 4 21 9 2 40
C_utg000001c: 4 4 0 21 9 4 40
D_bctg00000000: 21 21 21 0 21 21 40
E_Utg0: 9 9 9 21 0 9 40
F_ctg1: 3 2 4 21 9 0 40
G_0: 40 40 40 40 40 40 0
Even though all sequences have technically passed (i.e. haven't failed the --min_identity
and --max_indel_size
settings), two of them (D_bctg00000000
and G_0
) are clearly worse than the rest. I would therefore exclude those two and run Trycycler align again with the remaining five contigs.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth