Copyright (C) 2020-2024 Timothy James Becker
(python 3.10+ is now the preferred platform as of release 0.1.2)
python 3.10+, cython 0.29+, pysam 0.15+, numpy 1.18+ or
python 2.7.11+, cython 0.29+, pysam 0.15+, numpy 1.18+ (python2 is now untested)
python3 -m pip install https://github.com/timothyjamesbecker/somacx/releases/download/0.1.3/somacx-0.1.3.tar.gz
This repo has a Dockerfile that will build somacx for you or you can utilze the pre-built image for testing or cloud use:
docker pull timothyjamesbecker/somacx
python3 generator.py -r ref_fasta -j hg38 -C 11,22,X -o out_dir --model 0.25,0.75 --cov 2,6
-r the reference fasta file such as GRCh38/Hg38, hg19/human_g1k_v37decoy
-j the full JSON control file which includes SV type and sizes rates (including separate linked versus unlinked SNV/MNVs), A gene map such as refGene.txt.gz, encoded weighted gene list or mask files which we absract to the WCU data structure that governs the gain and loss probabilities of genomic regions and finally an optional user specified clonal tree topology. Use presets values: hg38 for Hg38/GRCh38 coordinates (no alternate contigs) or hv37d for the 1000 Genomes Phase 3 reference fasta used in the FusorSV paper. Full json presents can be generated using the included tools. -C the chromosomes from the given reference fasta that will be output (good for testing or producing a small dataset) -o the output directory where the germline genome file (~2X the size of the reference fasta) as well as the somatic genome file (2/cov) will be stored as well as the matching truth VCF files for each. The somatic VCF file includes the clonal tree topology encoded using the PEDIGREE entries in the header.
all options can now be given uniform random ranges (these values will override JSON presets)
--model a parametric control that controls the tree topology 0.0 will simulate Cancer Stem Cell (CSC) and 1.0 will yeild a balanced subclonal topology
--branch the chance of each node to branch every cycle
--decay the chance of each node to die off every cycle
--cov the desired sequencing coverage so that the user can control the approximate noise level threshold for having alleles of the lowest frequency present
The full distribution for a somatic genome generation will contain germline distribution parameters. Together the default
human JSON file called full.hg38.json.gz contains the following data slots (keys in JSON nomenclature):
gene_map - a JSON object used for gene symbol coordinates (has seq, gene and wcu keys for query)
g_var_map- a JSON object used for linked and unlinked germline multi-nucleotide and structural variation (type/size/frequency)
s_var_map - a JSON object used for linke and unlinked somatic germlin multi-nucleotide and structural variation (type/size/frequency)
clone_tree - a JSON object that defines the structure of the simulated clone tree in the somaic genome generation phase
somatic_aneuploidy - the multinomial distribution for each chromosome to generate aneuploidy (abnormal number of chromosomes)
wcus (weighted class units below) - JSON objects with chromosomes as keys with arrays for each chromosome containing: [start, end, length, {class_id:genomic_id}]. class_id are the g1kp3, onco, svmask, mitcp, mmej, nhej, apot identifiers and genomic_id are a gene SYMBOL or other unique genomic region id
germline_[genelist]loss_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for germline loss generation (for types like DEL, INV, TRA, etc)
germline[genelist]gain_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for germline gain generation (for types like DUP, etc)
somatic[genelist]loss_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for somatic loss generation (for types like DEL, INV, TRA, etc)
somatic[genelist]_gain_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for somatic gain generation (for types like DUP, etc)
More details on the individual JSON objects:
allows query by sequence name or by gene symbol. seq: will be each chromosome followed by an array of all genes and start,end coordinates so that all genes for a given chromosome can easily be integrated into gain or loss distribtions. gene: are the offciale gene symbols such as DDX11L1, TP53, etc and will resolve to one or more genomic coordinates.
keys denote the layer of the variation where 1 will be a MNV before layer 2 (SV layer) these SNV will be linked to the SVs that are intersecting, (such as DUP that would also duplicate the SNV) wheras layer 3 MNVs are applied after the SV events which will make them unlinked (such as a DUP with one copy containing a MNV but the other doesn't). In layer 2 (SV layer) there are parameters that govern the distributions of each type detailed further:
DUP l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome). CN - number of copies allowed (2,..,8) default. CNP - probabilty of each copy number. TYPE - DUP subtypes: TANDEM, DISPERSED, INVERTED. TP - probabilty of each type (type distribution)
TRA l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome). TYPE - subtype for traslocations: DEL or INS which are balanced or unbalanced. TP - is the type distribution for TRA
INV l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome). TYPE - is the subtype: PERFECT or COMPLEX. PERFECT have very simple signatures (perfect DNA repair) and reverse complement. COMPLES on the other hand have DEL and DUP posisble on the break ends as well as nove INS inside the INV SV call region. s:p - sets the complex INV subtype DEL/DUP/INS length distribution (50bp,100bp,etc)
DEL l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome).
INS l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome).
clone tree keeps the same of the somatic evolution in tree form but is used practically to spool the FASTA files and generate each clone. The root key keeps track of the genome name (sample name) while the tree key keeps the nested JSON tree structure for each clone which will be nodes. Alive is an array that determines if a specific clone will be generated which is useful for simulation of ancesters that dropped out from selective pressure. freq is an array that keeps track of the number of ancestors which is used to allocate variation (used for allele frequency generation)
Visual web broswer JSON editor is in development
gene_effects, read_utils, variant_utils and sim python modules offer much greater flexibility in simulating mammalian genomes