Paired Replicate Analysis of Allelic Differential Splicing Events
PAIRADISE (PAIred Replicate analysis of Allelic DIfferential Splicing Events) is a method for detecting allele-specific alternative splicing (ASAS) from RNA-seq data. Unlike conventional approaches that detect ASAS events one sample at a time, PAIRADISE aggregates ASAS signals across multiple individuals in a population. By treating the two alleles of an individual as paired, and multiple individuals sharing a heterozygous SNP as replicates, PAIRADISE formulates ASAS detection as a statistical problem for identifying differential alternative splicing from RNA-seq data with paired replicates.
PAIRADISE requires the following dependencies:
- Python >= (2.7.5)
- R >= (3.2.1)
- STAR >= (2.6.0a) (can be downloaded here)
The folowing Python dependencies are required:
- pysam >= (0.15.2)
- pybedtools >= (0.8.0)
- more-itertools >= (5.0.0)
- numpy >= (1.16.2)
In addition, the following R packages must be installed before installing PAIRADISE:
- nloptr >= (1.2.1)
- doParallel >= (1.0.14)
- foreach >= (1.4.4)
- binom >= (1.1.1)
- ggplot2 >= (3.1.1)
Download PAIRADISE from github:
git clone github.com/Xinglab/PAIRADISE/pairadise
Change your working directory to "pairadise" and run the following commands to configure and install PAIRADISE:
./configure
make
make install
To test if PAIRADISE has been successfully installed, type in the command pairadise_personalize -h
. You should see
pairadise_personalize -h
usage: pairadise_personalize [-h] [--gz] [-e E] [--rnaedit] [-v V] [-o O]
[-r R]
[command [command ...]]
pairadise2
positional arguments:
command
optional arguments:
-h, --help show this help message and exit
--gz flag denoting gzipped reads
-e E file containing RNA editing positions, downloaded from RADAR
--rnaedit flag to check for RNA editing events, must also provide an RNA
editing file usng -e parameter
-v V VCF genotype directory
-o O output directory
-r R reference fasta file
You may need permission if want to install PAIRADISE to a root PATH. You can bypass this issue by specifying a user R library directory:
R CMD INSTALL -l /user/R/lib src/pairadise_model/
PAIRADISE requires the following subdirectories and input data to be in the directory where you will be performing your analysis (we'll refer to this as the 'data directory'):
-
genome: Contains the reference genome gtf file, fasta files, and (optionally) RNA editing information. We used the following files:
- hg19.fa
- Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
- Homo_sapiens.Ensembl.GRCh37.75.gtf
- Human_AG_all_hg19_v2.txt
-
genotype: Contains the genotype data in the format of vcf files, one vcf file per chromosome. The genotype directory contains one subdirectory for each sample/replicate, each containing that sample's vcf files. The subdirectory names should match the sample names. If data are biological replicates corresponding to one common sample, you only need one subdirectory.
-
input: Contains the fastq files for each sample/replicate. The fastq files should have the .gz extension.
-
scripts: Contains the scripts used to run the PAIRADISE pipeline and statistical model.
The following .txt file should also be in the data directory:
‘sample_name.txt’: Contains the sample ID’s and RNA-seq read lengths. This file contains 3 columns: the first two columns contain the sample names and the last column contains the read length. When the data are biological replicates, column 1 contains the sample IDs of the biological replicates, and column 2 contains the names of the sample.
Run the following command to create multiple .qsub files, each of which performs different stages of the PAIRADISE analysis:
python scripts/run_preprocess.py sample_name.txt population_name path/to/data/directory/
For example, if we are analyzing the Geuvadis CEU population and we are in the data directory, the above command becomes:
python scripts/run_preprocess.py CEU.txt CEU ./
We will continue using CEU as our running example.
Submit each of the qsub files corresponding to Step 1 of the PAIRADISE pipeline (personalization, mapping, and assignment):
qsub -cwd -l h_vmem=90G qsub_files/pairadise_step1_NA12843.qsub
The above command performs Step 1 for one sample: NA12843. You should run an analogous qsub command for each sample.
Note: Step 1 can take a long time to run. For the CEU dataset, the typical run-time for one sample ranged from 2-5 hours.
Once Step 1 has been completed for every sample, submit the qsub file corresponding to Step 2 (joint annotation):
qsub -cwd -l h_vmem=10G qsub_files/pairadise_step2_annotation.qsub
Once Step 2 is finished, submit each of the qsub files corresponding to Step 3 of the PAIRADISE pipeline (counting):
qsub -cwd -l h_vmem=15G qsub_files/pairadise_step3_NA12843.qsub
The above command performs Step 3 for one sample: NA12843. You should run an analogous qsub command for each sample.
Once Step 3 has been completed for every sample, submit the qsub file corresponding to Step 4 (merging the counts):
qsub -cwd -l h_vmem=10G qsub_files/pairadise_step4_merge.qsub
Finally, we run the PAIRADISE statistical model and create plots of the significant events.
qsub -cwd -l h_vmem=10G qsub_files/pairadise_step5_stat.qsub
PAIRADISE outputs a table of significant ASAS events (default significance is set to FDR <= 10%). Each row of the output table corresponds to a significant ASAS event and contains the following columns:
- ExonID: Gene name, chromosome number and strand, SNP name, genomic location, reference and alternative alleles.
- IJC_REF: Exon inclusion counts for the reference allele.
- SJC_REF: Exon skipping counts for the reference allele.
- IJC_ALT: Exon inclusion counts for the alternative allele.
- SJC_ALT: Exon skipping counts for the alternative allele.
- incLen: Effective length of the exon inclusion isoform.
- skpLen: Effective length of the exon skipping isoform.
- pval: Raw (unadjusted) PAIRADISE p-value.
- qval: PAIRADISE p-value FDR adjusted using the Benjamini-Hochberg method.
- IncLevel1: Naive psi values for reference allele samples.
- IncLevel2: Naive psi values for alternative allele samples.
- AvgIncLevel1: Average psi value for reference allele samples.
- AvgIncLevel2: Average psi value for alternative allele samples.
- IncLevelDifference: Difference in average psi values between reference and alternative allele samples.
- AvgTotalCount1: Average total number of read counts for reference allele samples.
- AvgTotalCount2: Average total number of read counts for alternative allele samples.
- SampleName: Sample names.
- RefAltAllele: Reference allele and alternative allele (in the format Ref|Alt).
- AF: Allele frequencies.
For visualizing the PAIRADISE results, we recommend uploading the PAIRADISE output table directly into the accompanying Shiny app. Here's an example of a significant ASAS event visualized using our Shiny app:
We have developed a data visualization tool in R Shiny for visualizing the results of PAIRADISE. The app can be accessed here. Have fun!
Levon Demirdjian [email protected]
Yi Xing [email protected]