Skip to content
Susann Vorberg edited this page Jun 20, 2018 · 6 revisions

With CCMgen it is possible to generate a synthetic multiple sequence alignment from a Markov Random Field probability model specified by coupling potentials and a user-specified phylogenetic tree.

ccmgen.py [options] rawfile outalnfile

rawfile should be a MessagePack-formatted raw coupling potential file as generated by the -b option in CCMpredPy.

outalnfile is the filename where the sampled alignment should be written, in the format specified by --aln-format (default: FASTA).

Options

  • --alnfile <aln_file>: A reference alignment file to determine the diversity (effective number of sequences Neff) and the number of sequences of the synthetic MSA (overwrites --num-sequences and --mutation-rate-neff settings)
  • --num-sequences <nseq>: Specify the number of sequences in the synthetic MSA [default: 1024]
  • --max-gap-pos <max_gap_pos>: Ignore alignment positions with >max_gap_pos percent gaps in aln_file. [default: 100 (no removal of gaps)]
  • --max-gap-seq <max_gap_seq>: Remove sequences with >max_gap_seq percent gaps in aln_file. [default: 100 (no removal of sequences)]
  • --aln-format <format>: Parse and write all subsequent alignment files specified on the command line in another format. Supports all BioPython Bio.SeqIO file formats plus psicov. [default: fasta]
  • --num-threads <num_threads>: Specify the number of threads. [default: 1]

Initial Sequence Options

  • MRF-generated (--seq0-mrf <nmut>): Start out with an all-alanine sequence and use the MRF model to evolve the sequence for nmut Gibbs steps [default: nmut=500].
  • User-specified (--seq0-file <seq_file>): Provide the initial sequence from a file (useful if you have e.g. an ancestral sequence reconstruction). Make sure that the sequence identifier matches the name of the root node in the sampling phylogeny.

The options --seq0-file and --seq0-mrf are mutually exclusive.

Mutation Rate Options

  • User-specified (--mutation-rate <rate>): Give a user-specified mutation rate, measured in number of substitutions per unit of evolutionary distance on the phylogentic tree.
  • Target Neff (--mutation-rate-neff [<neff>]): Set the mutation rate to approximately hit a target number of effective sequences (Neff, calculated as in the HHsuite package). Without specifying <neff>, the value for Neff will be determined from <aln_file> (requires --alnfile option).

The options --mutation-rate and --mutation-rate-neff are mutually exclusive but required.

Phylogenetic Tree Options

  • User-specified (--tree-newick <tree_file>): Evolve the sequences according to an evolutionary tree, e.g. from a phylogenetic reconstruction program
  • Binary tree (--tree-binary): A binary tree with equally distributed branch lengths
  • 'Star-shaped' tree (--tree-star): A tree where all leaf nodes are direct descendants of the root node.
  • MCMC sample (--mcmc-sampling): Generate sequences as simple MCMC sample without following a tree topology.

Both binary and star-shaped tree will be generated to have a total evolutionary depth of 1 by default. You can adjust the sampled alignment target diversity by adjusting the mutation rate parameters.

MCMC Sampling Options:

  • --mcmc-sample-random-gapped: Sample sequences starting from random sequences. Gap structure of randomly selected input sequences will be copied. Gap positions are not sampled. (requires --alnfile option) [default]
  • --mcmc-sample-random: Sample sequences starting from random sequences
  • --mcmc-sample-aln: Sample sequences starting from sequences in aln_file. Gap positions are not sampled. (requires --alnfile option)
  • --mcmc-burn-in <mcmc_burn_in>: Number of Gibbs sampling steps to evolve the Markov chain before a sample is obtained.

The options --mcmc-sample-random, --mcmc-sample-random-gapped and --mcmc-sample-aln are mutually exclusive.

Examples

Simple example

Generate sequences using potentials in example/1atzA.braw.gz and write results to data/sampled.fasta. Set the number of sequences in the synthetic alignment to match the number of sequences in example/1atzA.fas and automatically determine a mutation rate so that the diversity of the generated alignment approximately matches that of example/1atzA.fas:

ccmgen --alnfile example/1atzA.fas \
       --tree-star \
       --seq0-mrf 500 \
       --mutation-rate-neff \
       example/1atzA.braw.gz \
       data/sampled.fasta