This small RNA sequencing pipeline provides a bioinformatics solution to process RNA sequencing data for downstream analyses. The pipeline is built in Snakemake and can be run on different platforms and high performance computing (HPC) systems. It is packed with an Anaconda installation to ensure reproducibility.
To install this workflow, clone the repo:
git clone https://github.com/sinanugur/sncRNA-workflow.git
cd sncRNA-workflow
If you have Anaconda, a new environment can be created
conda env create -n smrnaworkflow --file environment.yml
conda activate smrnaworkflow
If that fails, you can update an existing environment
conda env update -f environment.yml --prune
Installing mamba also helps (https://mamba.readthedocs.io/en/latest/installation.html)
conda install mamba -n base -c conda-forge
mamba env create -n smrnaworkflow --file environment.yml
conda activate smrnaworkflow
Make sure you have human genome file into the databases
folder, you can download it by typing:
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
then put this file under databases
directory. Do not unzip the file, this will also save space.
You can also run download_databases.sh
script which will auto-download the genome file.
You should create a new directory called data and place your FASTQ files or their symbolic links into data/
directory. You need an active Conda installation with Snakemake or create an environment using the YAML file. You do not have to install any other requirements. This will trigger a workflow run immediately using 15 threads:
if the environment is properly set with all the packages, simply type:
snakemake -j 15
GENCODE (v38)
miRBase (v22)
piRBase (v1.0)
You may update the databases. These versions are the release versions.
This workflow will generate results/
directory.
This directory contains count tables and sample statistics.
Full GENCODE count table: gencode.tsv
long non-coding RNA (GENCODE): lincRNA.tsv
miscellaneous RNA (GENCODE): misc_RNA.tsv
mRNA (GENCODE): protein_coding.tsv
Small nucleolar RNA (GENCODE): snoRNA.tsv
Small nuclear RNA (GENCODE): snRNA.tsv
Small Cajal body-specific RNA (GENCODE): scaRNA.tsv
tRNA (GENCODE): tRNA.tsv
miRBase count tables:
miRBase miRNA: miRNA.tsv
miRBase miRNA precursor: miRNA_precursor.tsv
piwi-interacting RNAs (piRBase): piRNA.tsv
tRNA derived fragments: tRF.tsv (Generated by MINTmap, see https://cm.jefferson.edu/mintmap/)
This workflow was adapted from our small RNA analysis study. Please cite if you find this useful: https://doi.org/10.1080/15476286.2017.1403003
The study was funded by the European Union’s Horizon 2020 research and innovation program (grant 825741) and the Research Council of Norway under the Program Human Biobanks and Health Data (grant numbers 229621/H10 and 248791/H10).