A set of scripts for 3' RNA-seq quantification
To install run python setup.py install
.
To install locally, without root privelegies, run python setup.py install --user
. Don't forget to add .local/bin
to your PATH
. You can do it by adding line export PATH="$HOME/.local/bin:$PATH"
to the ~/.profile
file.
You can quickly run analysis for the example data.
quant3p -n example -g example/mm10.slice.gtf example/bam/*.bam
Parameters are:
-n NAME
sets the name of experiment to beNAME
,-g GTF
tells to useGTF
as the annotation,example/bam/*.bam
is a list of bam-files to process.
This command produces one file example.cnt
that contatins gene counts, that looks like this:
2h_rep1 2h_rep2 4h_rep1 4h_rep2
100039246 0 0 0 0
11744 112 118 72 86
14673 105 114 52 67
211623 0 0 0 0
231842 47 39 16 17
232157 1799 1947 2060 2075
66039 1 2 0 0
78653 147 148 90 131
NM_001270503_dup1 0 0 0 0
NM_001270503_dup2 0 0 0 0
NM_207229_dup1 0 0 0 0
NM_207229_dup2 0 0 0 0
__no_feature 463 489 389 433
__ambiguous 0 0 0 0
__too_low_aQual 0 0 0 0
__not_aligned 0 0 0 0
__alignment_not_unique 289 190 231 194
If you want to look at intermediate files, you can use parameter --keep-temp
.
quant3p
consists of four steps that can be useful by itself:
- Finding potential exons by calling peaks with MACS2,
- Fixing annotation by associating the potential exons with annotated genes,
- Fixing multimapper tags in the alignment files by selecting only alignment that maps only to annotated exons,
- Counting reads with HTSeq.
To find potential exons we call peaks in combined RNA-seq data. We do it separetely for positive and negative strands
to use strand-specificity of our data. It's done by calling macs2-stranded
.
macs2-stranded -n example example/bam/*.bam
Here:
-n example
sets the name of experiment to beexample
,example/bam/*.bam
is a list of bam-files to process.
For the example it should produce 15 peaks (file example_peaks.narrowPeak
):
chr14 25846959 25847377 example.pos_peak_1 372 + 13.33333 42.28070 37.23716 163
chr14 25871156 25871413 example.pos_peak_2 114 + 9.52381 16.27811 11.43326 80
chr14 25886465 25886948 example.pos_peak_3 1319 + 25.60976 138.30992 131.94069 273
chr14 26026235 26026718 example.pos_peak_4 1324 + 25.81967 138.85394 132.45966 273
chr14 26165849 26166332 example.pos_peak_5 1324 + 25.81967 138.85394 132.45966 273
chr5 140752996 140753373 example.pos_peak_6 718 + 22.40260 77.17718 71.88202 214
chr6 83341577 83341950 example.pos_peak_7 724 + 6.68317 77.78582 72.48792 197
chr6 83342407 83342877 example.pos_peak_8 1071 + 8.43220 112.89129 107.17482 249
chr6 83343311 83343810 example.pos_peak_9 1078 + 8.51155 113.64171 107.87556 272
chr6 83351316 83351531 example.pos_peak_10 173 + 9.23567 22.26187 17.36432 79
chr6 83353068 83353297 example.pos_peak_11 397 + 14.96815 44.79282 39.73446 80
chr6 83358160 83358528 example.pos_peak_12 1767 + 34.81308 185.13048 176.73187 183
chr5 140758332 140758782 example.neg_peak_1 1403 - 28.53982 148.70955 140.31094 182
chr5 140806999 140807270 example.neg_peak_2 73 - 7.81250 13.03361 7.39012 75
chr6 83346833 83347040 example.neg_peak_3 200 - 13.04348 25.95999 20.05134 37
Next we fix annotation by adding RNA-seq peaks as exons to the transcripts if the peak overlaps with a region of 5Kbp downstream of the transcript's 3' end.
gtf-extend -g example/mm10.slice.gtf -p example_peaks.narrowPeak -o mm10.slice.fixed.gtf --extns-out mm10.slice.extensions.gtf
Here:
-g example/mm10.slice.gtf
tells to useexample/mm10.slice.gtf
as the annotation,-p example_peaks.narrowPeak
tells to useexample_peaks.narrowPeak
as peaks,-o mm10.slice.fixed.gtf
tells to put fixed annotation intomm10.slice.fixed.gtf
file,--extns-out mm10.slice.extensions.gtf
tells to put newly added exons intomm10.slice.extensions.gtf
files.
In the example we added 4 new exons (file mm10.slice.extensions.gtf
):
chr6 extension exon 83341578 83341950 724 + . transcript_id "NM_145571"; gene_id "232157"
chr6 extension exon 83342408 83342877 1071 + . transcript_id "NM_145571"; gene_id "232157"
chr6 extension exon 83343312 83343810 1078 + . transcript_id "NM_145571"; gene_id "232157"
chr5 extension exon 140758333 140758782 1403 - . transcript_id "NM_010302"; gene_id "14673"
Main idea here is that we expect RNA-seq reads to be aligned to the genes.
So, we discard alignments of multimappers to the unannotated regions and
fix multimapping annotation (NH
SAM field) for such reads. Some of them
can be uniquely mapped to one gene.
fix-mm -g example/mm10.slice.gtf example/bam/2h_rep1.bam -o 2h_rep1.fixed.bam
Here:
-g example/mm10.slice.gtf
tells to useexample/mm10.slice.gtf
as the annotation,*example/bam/2h_rep1.bam
tells to fixexample/bam/2h_rep1.bam
file,-o 2h_rep1.fixed.bam
tells to put fixed bam-file into2h_rep1.fixed.bam
,
In the 2h_rep1.bam
file all the 301 multimappers can be uniquely mapped to a single position covered by an exon.
Counting with HTSeq is rather straightworward:
samtools view 2h_rep1.fixed.bam | htseq-count -s yes -t exon - mm10.slice.fixed.gtf
Here we use the fixed bam file (2h_rep1.fixed.bam
) and annotaion (mm10.slice.fixed.gtf
).
Output should be like this:
100039246 0
11744 112
14673 105
211623 0
231842 47
232157 1799
66039 1
78653 147
NM_001270503_dup1 0
NM_001270503_dup2 0
NM_207229_dup1 0
NM_207229_dup2 0
__no_feature 463
__ambiguous 0
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 289
If we run htseq-count
for the original data:
samtools view -h example/bam/2h_rep1.bam | htseq-count -s yes -t exon - example/mm10.slice.gtf
A lot of reads will not be counted:
100039246 0
11744 0
14673 0
211623 0
231842 44
232157 17
66039 0
78653 56
NM_001270503_dup1 0
NM_001270503_dup2 0
NM_207229_dup1 0
NM_207229_dup2 0
__no_feature 2020
__ambiguous 0
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 826
macs2-stranded
depends on samtools
and macs2
executables. macs2
version should be >= 2.0.10
gtf-extend
and fix-mm
needs pybedtools
and pysam
python packages and bedtools
with samtools
installed
All the dependencies can be installed using conda: conda install -c bioconda macs2 pybedtools pysam HTseq