See Changelog
SSP (strand-shift profile) is a tool for quality assessment of ChIP-seq data without peak calling.
SSP provides metrics to:
- quantify the S/N for both point- and broad-source factors (NSC),
- estimate peak reliability based on the mapped-read distribution throughout a genome (Bu),
- and estimate peak intensity and peak mode (point- or broad-source, FCS).
The outputs of SSP are displayed in PDF format and also written to text files.
SSP is written in C++11 and requires the following programs and libraries. The version numbers listed have been tested successfully.
- Boost C++ library (1.53.0, 1.58.0)
- GNU Scientific Library (1.15, 2.1)
- zlib (1.2.7, 1.2.8)
- CMake (>2.8)
- HTSlib (1.10.2) (for SAM/BAM/CRAM formatted input)
On Ubuntu:
sudo apt install git build-essential libboost-all-dev libcurl4-gnutls-dev \
liblzma-dev libz-dev libbz2-dev cmake
On CentOS and Red Hat:
sudo yum -y install git gcc-c++ clang boost-devel zlib-devel bzip2-devel cmake
On Mac:
brew install curl xz zlib boost cmake
git clone https://github.com/rnakato/SSP.git
cd SSP
make
For example, if you downloaded SSP into the $HOME/my_chipseq_exp directory, type:
export PATH = $PATH:$HOME/my_chipseq_exp/SSP/bin
SSP and DROMPA are also probatively available on Docker Hub.
To obtain a docker image for SSP and DROMPA, type:
docker pull rnakato/ssp_drompa
docker run -it --rm rnakato/ssp_drompa ssp
For Singularity:
singularity build ssp_drompa.img docker://rnakato/ssp_drompa
singularity exec ssp_drompa.img ssp
Usage: ssp [option] -i <inputfile> -o <output> --gt <genome_table>
Options:
Input/Output:
-i [ --input ] arg Mapping file. Multiple files are allowed (separated by ',')
-o [ --output ] arg Prefix of output files
--odir arg (=sspout) output directory name
-f [ --ftype ] arg {SAM|BAM|CRAM|BOWTIE|TAGALIGN}: format of input file
TAGALIGN can be gzip'ed (extension: tagAlign.gz)
For paired-end:
--pair add when the input file is paired-end
--maxins arg (=500) maximum fragment length
Genome:
--gt arg Genome table (tab-delimited file describing the name and length of
each chromosome)
--mptable arg Genome table of mappable regions
--include_allchr Include all chromosomes for calculation (default: autosomes only,
i.e., 'chrN', where N is a numeric number)
Fragment:
--nomodel omit fraglent length estimation (default: estimated by strand-shift profile)
--flen arg (=150) predefined fragment length (with --nomodel option)
Strand shift profile:
--num4ssp arg (=10000000) Read number for calculating backgroud uniformity (per 100 Mbp)
--ng_from arg (=500000) start shift of background
--ng_to arg (=1000000) end shift of background
--ng_step arg (=5000) step shift on of background
--ssp_cc make ssp based on cross correlation
--ssp_hd make ssp based on hamming distance
--ssp_exjac make ssp based on extended Jaccard index
--eachchr make chromosome-sparated ssp files
Fragment cluster score:
--ng_from_fcs arg (=100000) fcs start of background
--ng_to_fcs arg (=1000000) fcs end of background
--ng_step_fcs arg (=100000) fcs step on of background
Library complexity:
--thre_pb arg (=0) PCRbias threshold (default: more than max(1 read, 10 times greater
than genome average))
--ncmp arg (=10000000) read number for calculating library complexity
--nofilter do not filter PCR bias
Others:
-p [ --threads ] arg (=1) number of threads to launch
-v [ --version ] print version
-h [ --help ] show help message
The simplest command is:
ssp -i ChIP.sam -o ChIP --gt genometable.txt
then the output files (prefix: "ChIP") are generated in the directory "sspout (default)". The format of input file is automatically detected by postfix(.sam/.bam/.cram/.bowtie/.tagalign(.gz)). If the detection does not work well, supply -f option (e.g., "-f BAM").
The genome table file (genometable.txt) is a tab-delimited file describing the name and length of each chromosome (see 4.1.) The chromosome names in the map file and the genome table file must be same.
To supply the mappable genome table and use multiple CPUs:
ssp -i ChIP.bam -o ChIP --gt genometable.txt --mptable mptable.txt -p 4
"-p 4" specifies the number of CPUs used. The mappable genome table file is necessary for accurate estimation of background uniformity.
SSP allows multiple input files (separated by ",")
ssp -i ChIP1.bam,ChIP2.bam,ChIP3.bam -o ChIP --gt genometable.txt
Note that the chromosome length should be enough longer than the background length specified. For small genomes (e.g., yeast), the background length should be shorten:
ssp -i ChIP1.bam -o ChIP --gt genometable.txt --ng_from 10000 --ng_to 50000 --ng_step 500
In this parameter set, the background region is the average ranging from 10k to 50k at steps of 500 bp.
By default, FCS is calcutated for 10M nonredundant reads. If the number of nonredundant reads in the input data are smaller than 10M, specify smaller number for fair comparison among samples as follows:
ssp -i ChIP1.bam -o ChIP --gt genometable.txt --num4ssp 5000000
When specifying smaller read number for --num4ssp, FCS score becomes smaller, but the magnitude relation among samples is consistent.
By default, SSP uses only autosomes, i.e., ‘chrN’, where N is a numeric number, in the genome_table file. However, in some cases the chromosome names do not start with “chr”. In such a case, add the --include_allchr
option to include all chromosomes in the genome_table file.
ssp -i ChIP1.bam -o ChIP --gt genometable.txt --include_allchr
- ChIP.stats.txt: Stats of the sample (read number, read length, estimated fragment length, NSC, RLSC, RSC, background uniformity, FCS)
- ChIP.jaccard.csv: Jaccard score for each strand shift d
- ChIP.jaccard.pdf: Strand-shift profiles (-500 < d < 1500 and 0 < d < 1M)
- ChIP.jaccard.R: R script to make ChIP.jaccard.pdf
- ChIP.jaccard.R.log: Log file of ChIP.jaccard.R
- ChIP.fcs.csv: FCS for each strand shift d
- ChIP.pnf.csv: PNF (the proportion of neighboring fragments) for each s
- ChIP.FCS.pdf: Profiles of PNF, cPNF (the cumulative proportion of neighboring fragments) and FCS
- ChIP.FCS.R, ChIP.FCS.R.log: R script and log file to make ChIP.FCS.pdf
The genome table file is a tab-delimited file describing the name and length of each chromosome. To make it, use makegenometable.pl in scripts directory as follows:
scripts/makegenometable.pl genome.fa > genometable.txt
The mappability table file is a tab-delimited file describing the name and 'mappabile' length of each chromosome. The mappability tables generated for several species (36 mer and 50 mer) are provided in mptable directory, which are based on the code from Peakseq. See the manual for DROMPA3 for detail.
Nakato R., Shirahige K. Sensitive and robust assessment of ChIP-seq read distribution using a strand-shift profile, Bioinformatics, 2018, https://doi.org/10.1093/bioinformatics/bty137