miRPursuit

Check out our read the docs page for a more structured overview of this project:

Read the Docs - DOCUMENTATION

miRPursuit – a pipeline for automated analyses of small RNAs in non-model plants

^a,b,*,φ

^a,b,φ

^a,b

^a,b,c

^a

^b

^c

^φ

These authors contributed equally to this work.

Abstract

MiRPursuit, a pipeline developed for running end- to-end analyses of high-throughput small RNA (sRNA) sequence data in model and non-model plants, from raw data to identified and annotated conserved and novel sequences. It consists of a series of UNIX shell scripts, which connect the in- and outputs of several established, open-source sRNA analysis software. This way, high customizability and full transparency of the analyses and the involved parameters can be combined with convenient workflow management, also for users without advanced computational skills. One considerable advantage is that several sRNA libraries can be processed in parallel.

Small non-coding RNAs (sRNAs) are pivotal in the regulation of gene expression during plant growth and development, and in response to abiotic and biotic stresses. The affordable, high-throughput sequencing provided by NGS platforms is an attractive approach to discover the small RNAs involved in the regulation of important biological processes in plants. However, the large amounts of data generated by such type of studies can be staggering and requires efficient tools to quickly analyze the data produced.

This pipeline has been built around a publicly available software package, the University of East Anglia sRNA workbench[1], which includes various tools which can be used to identify sRNA classes, such as micro RNAs (miRNAs) and trans-acting siRNA (tasi), both conserved and novel and predict their precursor RNA using a user specified reference genome. Moreover, the target genes can be predicted and validated by using degradome fragment sequences and a reference transcriptome.

By setting up a workflow, a predefined sequence of tools can be run autonomously. The NGS raw data obtained from various libraries can be supplied as input files, allowing the user to process multiple libraries in one command line interaction. The degree of customization in this pipeline provides the ability to fine tune the workflow with the freedom to use user supplied omics data.

Thus, the main advantage of using this system over the workbench's individual tools is minimizing the need to perform manual repetitive tasks. The pipeline automatically connects each step by processing the data flow between tools. This sRNA workflow was implemented in bash which is optimal to be run on unix servers allowing uninterrupted runs on high capacity clusters enabling the processing of large scale multiple datasets. The end result provides the identification and annotation of conserved and novel miRNAs and tasiRNAs, along with the expression matrix of the libraries from the input dataset, which can be easily imported to excel or R to perform differential expression analyses.

As future work the development of the pipeline will include, a database of the annotations generated and a user friendly graphic interface.

This pipeline was build to simplify the manipulation of NGS sequenced data. Use of this pipeline provides a seamless classification of sRNA, prediction of TaSi and sRNA targets from FASTQ files.

How to start:

UEA Workbench Optimized for linux version (~3.2)

perl version (5.8)

Java optimized for version (~1.7)

On ubuntu it can be found in this package: libc6-i386

sudo apt-get update

sudo apt-get libc6-i386

Set up the variables in the config dir.

Patman

Tar

Fastx Toolkit

run miRPursuit.sh

Installation

From git hub

$ cd /toDesiredLocation/
$ git clone https://github.com/forestbiotech-lab/miRPursuit.git
$ cd miRPursuit

From tar

#Download archive from github
$ cd /toDesiredLocation/
$ unzip miRPursuit-master.zip

Dependencies

To install the necessary dependencies you can run install.sh in the main folder

$ cd /pathtoMiRPursuit/
$ ./install.sh

Custom Installation

Set software dir in config file

$ cd /pathtoMiRPusuit/
$ vim config/software_dirs.cfg

Running test dataset

edit config/workdirs.cfg

Set INSERTS_DIRS=pathToMiRPursuit/testDataset (Example for test dataset)

Use as reference genome a simple plant genome. (Dataset has sRNAS detected by C.canephora genome)

Example code to analyse test_dataset (Make sure all var above mentions are already set):

$ bash pathToMirPursuit/miRPursuit.sh -f 1 -l 2 --fasta test_dataset-

Analysing sRNA

Works for fastq and fasta input formats.

config - Directory that has all the variables for the workflow.

workdirs.cfg

workdir - path to workdir (will create one if it doesn't exist)

genomes path to genomes

GENOME_MIRCAT _The path to the genome to be used by mircat. Set to ${GENOME} if you don't need to run various parts. (My be necessary if you have short amount of ram.)"

FILTER_SUF _Filter-suffix to chose the predefined filter settings to be used.

MEMORY - Amount of memory to be used my java when using memory intensive scripts. Ex:10g, 2000m ...

THREADS - Number of cores to be used during execution

INSERTS_DIR Path to the inserts directory

MIRBASE Path to mirbase database

software_dirs.cfg

patman_genome.cfg

wbench_mircat.cfg

wbench_tasi.cfg

Programs

sRNAworkFlow.sh

-f|--lib-first

-l|--lib-last

-h|--help

-s|--step

Step 1: Wbench Filter

Step 2: Filter Genome & miRBase

Step 3: Tasi

Step 4: Mircat

Step 5: PareSnip

--fasta

--fastq

--trim

mirbase hits

predicted targets

predicted mRNA

[workdir]/logs

[workdir]/counts

Figure 1

In the file structure, each rectangle represents a folder, dotted lines indicate relative paths, while solid lines indicate direct relation (folder is child of arrow origin).

/miRPursuit is located in the path where it was installed. /[workdir_name] has the path that was set in workdir in workdir.cfg. /config has all the configuration files specified in supp file 1. /count has all count files generated. /data stores all generated files along if any intermediary files generated by the processes in the pipeline. /log stores all the log file related to the pipeline execution.

------

predict_target.sh

-f|--lib-first "First library to be processed"

-l|--lib-last "last Library to be processed"

-d|--degradome "Degradome location"

-h|--help "Display help"

targets

For detailed file names check the corresponding pipeline. This program executes the following programs in that order. Stats on the number of reads are stored in the count directory. The count file is not really a tsv it is in fact a space separated values. But I though i was close enough to a tsv. The format used for counts is %y%m%d:%h%m&s-type-lib[lib_first]-[lib_last].tsv

The log directory has alot of information about what happened during the execution of the scripts. It has a similar file notations as the count files. %y%m%d:%h%m%s-type.log or *.log.ok if it ran till the end. *.

References:

1 - Borges F & Martienssen RA (2015) The expanding world of small RNAs in plants. Nat Rev Mol Cell Biol 16, 727–741.

2 - Sunkar R (2010) MicroRNAs with macro-effects on plant stress responses. Semin Cell Dev Biol 21, 805–811.

3 - Liu J & Vance CP (2010) Crucial roles of sucrose and miRNA399 in systemic signaling of P deficiency - A tale of two team players? Plant Signaling and Behaviour 5, 1–5.

4 - Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281–297.

5 - Allen E, Xie Z, Gustafson AM & Carrington JC (2005) microRNA-directed phasing during trans-acting siRNA biogenesis in plants. Cell 121, 207–221.

6 - Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X & Mortazavi A (2016) A survey of best practices for RNA-seq data analysis. Genome Biol 17, 13.Page 11 of 399 FEBS Letters.

7 - Kozomara A & Griffiths-Jones S (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 39, D152–7.

8 - Chaves I, Lin Y-C, Pinto-Ricardo C, Van de Peer Y & Miguel C (2014) miRNA profiling in leaf and cork tissues of Quercus suber reveals novel miRNAs and tissue-specific expression patterns. Tree Genet. Genomes 10, 721–737.

9 - Stocks MB, Moxon S, Mapleson D, Woolfenden HC, Mohorianu I, Folkes L, Schwach F, Dalmay T & Moulton V (2012) The UEA sRNA workbench: a suite of tools for analysing and visualizing next generation sequencing microRNA and small RNA datasets. Bioinformatics 28, 2059–2061.

10 - BabrahamBioinformatics (2016) A quality control tool for high throughput sequence data http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

11 - HannonLab (2010) FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/index.html.

12 - Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J & Finn RD (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res 43, D130–7.

13 - Prüfer K, Stenzel U, Dannemann M, Green RE, Lachmann M & Kelso J (2008) PatMaN: rapid alignment of short sequences to large databases. Bioinformatics 24, 1530–1531.

14 - Chen H-M, Li Y-H & Wu S-H (2007) Bioinformatic prediction and experimental validation of a microRNA-directed tandem trans-acting siRNA cascade in Arabidopsis. Proc Natl Acad Sci U S A 104, 3318–3323.

15 - Griffiths-Jones S (2006) miRBase: the microRNA sequence database. Methods Mol Biol 342, 129–138.

16 - Griffiths-Jones S, Saini HK, van Dongen S & Enright AJ (2008) miRBase: tools for microRNA genomics. Nucleic Acids Res 36, D154–8.

17 - Taylor RS, Tarver JE, Foroozani A & Donoghue PCJ (2017) MicroRNA annotation of plant genomes - Do it right or not at all. Bioessays 39.

18 - Meyers BC, Axtell MJ, Bartel B, Bartel DP, Baulcombe D, Bowman JL, Cao X, Carrington JC, Chen X, Green PJ, Griffiths-Jones S, Jacobsen SE, Mallory AC, Martienssen RA, Poethig RS, Qi Y, Vaucheret H, Voinnet O, Watanabe Y, Weigel D & Zhu J-K (2008) Criteria for annotation of plant MicroRNAs. Plant Cell 20, 3186–3190.

19 Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N & Rokhsar DS (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40, D1178–86.

20 Kersey PJ, Allen JE, Armean I, Boddu S, Bolt BJ, Carvalho-Silva D, Christensen M, Davis P, Falin LJ, Grabmueller C, Humphrey J, Kerhornou A, Khobova J, Aranganathan NK, Langridge N, Lowy E, McDowall MD, Maheswari U, Nuhn M, Ong CK & Staines DM (2016) Ensembl Genomes 2016: more genomes, more complexity. Nucleic Acids Res 44, D574–80.

Name		Name	Last commit message	Last commit date
Latest commit History 398 Commits
config		config
docs		docs
images		images
scripts		scripts
testDataset		testDataset
.gitignore		.gitignore
Changelog		Changelog
LICENSE		LICENSE
Programs.md		Programs.md
README.md		README.md
counts_merge.sh		counts_merge.sh
extract_fasteris_inserts.sh		extract_fasteris_inserts.sh
extract_lcscience_inserts.sh		extract_lcscience_inserts.sh
install.sh		install.sh
miRPursuit.sh		miRPursuit.sh
mirprof.sh		mirprof.sh
pipe_count_reads.sh		pipe_count_reads.sh
pipe_fasta.sh		pipe_fasta.sh
pipe_fastq.sh		pipe_fastq.sh
pipe_filter_genome_mirbase.sh		pipe_filter_genome_mirbase.sh
pipe_filter_wbench.sh		pipe_filter_wbench.sh
pipe_mircat.sh		pipe_mircat.sh
pipe_tasi.sh		pipe_tasi.sh
pipe_trim_adaptors.sh		pipe_trim_adaptors.sh
predict-target.sh		predict-target.sh
write_report.sh		write_report.sh
xvfb-run-safe		xvfb-run-safe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miRPursuit

Table of Contents

Abstract

How to start:

Installation

From git hub

From tar

Dependencies

Custom Installation

Running test dataset

Analysing sRNA

Programs

References:

About

Releases 3

Packages

Contributors 2

Languages

License

forestbiotech-lab/miRPursuit

Folders and files

Latest commit

History

Repository files navigation

miRPursuit

Table of Contents

Abstract

How to start:

Installation

From git hub

From tar

Dependencies

Custom Installation

Running test dataset

Analysing sRNA

Programs

References:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages