Skip to content

A pathway based somatic genome generation framework

Notifications You must be signed in to change notification settings

timothyjamesbecker/somacx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub All Releases DOI License: GPL v3 Alt text

A complex generative genome modeling framework

Copyright (C) 2020-2024 Timothy James Becker Alt text

Requirements

(python 3.10+ is now the preferred platform as of release 0.1.2)
python 3.10+, cython 0.29+, pysam 0.15+, numpy 1.18+ or
python 2.7.11+, cython 0.29+, pysam 0.15+, numpy 1.18+ (python2 is now untested)

PIP Installation

python3 -m pip install https://github.com/timothyjamesbecker/somacx/releases/download/0.1.3/somacx-0.1.3.tar.gz

Docker Installation

This repo has a Dockerfile that will build somacx for you or you can utilze the pre-built image for testing or cloud use:

docker pull timothyjamesbecker/somacx

Basic Usage

python3 generator.py -r ref_fasta -j hg38 -C 11,22,X -o out_dir --model 0.25,0.75 --cov 2,6

-r the reference fasta file such as GRCh38/Hg38, hg19/human_g1k_v37decoy

-j the full JSON control file which includes SV type and sizes rates (including separate linked versus unlinked SNV/MNVs), A gene map such as refGene.txt.gz, encoded weighted gene list or mask files which we absract to the WCU data structure that governs the gain and loss probabilities of genomic regions and finally an optional user specified clonal tree topology. Use presets values: hg38 for Hg38/GRCh38 coordinates (no alternate contigs) or hv37d for the 1000 Genomes Phase 3 reference fasta used in the FusorSV paper. Full json presents can be generated using the included tools. -C the chromosomes from the given reference fasta that will be output (good for testing or producing a small dataset) -o the output directory where the germline genome file (~2X the size of the reference fasta) as well as the somatic genome file (2/cov) will be stored as well as the matching truth VCF files for each. The somatic VCF file includes the clonal tree topology encoded using the PEDIGREE entries in the header.

options

all options can now be given uniform random ranges (these values will override JSON presets)

--model a parametric control that controls the tree topology 0.0 will simulate Cancer Stem Cell (CSC) and 1.0 will yeild a balanced subclonal topology

--branch the chance of each node to branch every cycle

--decay the chance of each node to die off every cycle

--cov the desired sequencing coverage so that the user can control the approximate noise level threshold for having alleles of the lowest frequency present

Advanced JSON Usage

The full distribution for a somatic genome generation will contain germline distribution parameters. Together the default human JSON file called full.hg38.json.gz contains the following data slots (keys in JSON nomenclature):

full.hg38.json.gz (all distributions packed up into one file)

gene_map - a JSON object used for gene symbol coordinates (has seq, gene and wcu keys for query)

g_var_map- a JSON object used for linked and unlinked germline multi-nucleotide and structural variation (type/size/frequency)

s_var_map - a JSON object used for linke and unlinked somatic germlin multi-nucleotide and structural variation (type/size/frequency)

clone_tree - a JSON object that defines the structure of the simulated clone tree in the somaic genome generation phase

somatic_aneuploidy - the multinomial distribution for each chromosome to generate aneuploidy (abnormal number of chromosomes)

wcus (weighted class units below) - JSON objects with chromosomes as keys with arrays for each chromosome containing: [start, end, length, {class_id:genomic_id}]. class_id are the g1kp3, onco, svmask, mitcp, mmej, nhej, apot identifiers and genomic_id are a gene SYMBOL or other unique genomic region id

germline_[genelist]loss_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for germline loss generation (for types like DEL, INV, TRA, etc)

germline
[genelist]gain_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for germline gain generation (for types like DUP, etc)

somatic
[genelist]loss_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for somatic loss generation (for types like DEL, INV, TRA, etc)

somatic
[genelist]_gain_wcu(s)- g1kp3, onco, svmask, mitcp, mmej, nhej, apot gene list based ranges that will become the positional distributions for somatic gain generation (for types like DUP, etc)

More details on the individual JSON objects:

gene_map

allows query by sequence name or by gene symbol. seq: will be each chromosome followed by an array of all genes and start,end coordinates so that all genes for a given chromosome can easily be integrated into gain or loss distribtions. gene: are the offciale gene symbols such as DDX11L1, TP53, etc and will resolve to one or more genomic coordinates.

g_var_map, s_var_map

keys denote the layer of the variation where 1 will be a MNV before layer 2 (SV layer) these SNV will be linked to the SVs that are intersecting, (such as DUP that would also duplicate the SNV) wheras layer 3 MNVs are applied after the SV events which will make them unlinked (such as a DUP with one copy containing a MNV but the other doesn't). In layer 2 (SV layer) there are parameters that govern the distributions of each type detailed further:

DUP l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome). CN - number of copies allowed (2,..,8) default. CNP - probabilty of each copy number. TYPE - DUP subtypes: TANDEM, DISPERSED, INVERTED. TP - probabilty of each type (type distribution)
TRA l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome). TYPE - subtype for traslocations: DEL or INS which are balanced or unbalanced. TP - is the type distribution for TRA
INV l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome). TYPE - is the subtype: PERFECT or COMPLEX. PERFECT have very simple signatures (perfect DNA repair) and reverse complement. COMPLES on the other hand have DEL and DUP posisble on the break ends as well as nove INS inside the INV SV call region. s:p - sets the complex INV subtype DEL/DUP/INS length distribution (50bp,100bp,etc)
DEL l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome).
INS l:n - distribution of the lengths (chance of a SV of lenght l per bp of genome).

clone_tree

clone tree keeps the same of the somatic evolution in tree form but is used practically to spool the FASTA files and generate each clone. The root key keeps track of the genome name (sample name) while the tree key keeps the nested JSON tree structure for each clone which will be nodes. Alive is an array that determines if a specific clone will be generated which is useful for simulation of ancesters that dropped out from selective pressure. freq is an array that keeps track of the number of ancestors which is used to allocate variation (used for allele frequency generation)

Web Browser JSON Editor

Visual web broswer JSON editor is in development

Python API

gene_effects, read_utils, variant_utils and sim python modules offer much greater flexibility in simulating mammalian genomes