-
Notifications
You must be signed in to change notification settings - Fork 13
/
README
525 lines (401 loc) · 23 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
LICENSE
CleaveLand4.pl
Copyright (c) 2013-2018 Michael J. Axtell
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License along
with this program. If not, see <http://www.gnu.org/licenses/>.
SYNOPSIS
CleaveLand4 : Finding evidence of sliced targets of small RNAs from
degradome data
AUTHOR
Michael J. Axtell, Penn State University, [email protected]
VERSION
4.5 : July 11, 2018
INSTALL
Dependencies - Required Perl Modules
Getopt::Std
Math::CDF
CleaveLand4.pl is a perl program, so it needs perl installed on your
system. It was developed on perl version 5.10.0, and hasn't been tested
on other versions (but there is no reason to suspect problems with other
perl 5.x versions). CleaveLand4.pl will not compile unless the above two
Perl modules are also installed in your perl's @INC. Getopt::Std is
pre-loaded into most (all?) Perl distros. But you may need to install
Math::CDF from CPAN. Only one Math::CDF function is used by CleaveLand4
('pbinom').
Dependencies - PATH executables
bowtie (version 0.12.x or 1.x)
bowtie-build
RNAplex (from Vienna RNA package)
GSTAr.pl (Version 1.0 or higher -- distrubuted with CleaveLand4)
R
samtools
All of the above must be executable from your PATH. Depending on the
mode of the CleaveLand4 run (see below), only a subset of these programs
may be required for a given run.
Installation
Except for the above dependencies, there is no "real" installation. If
the script is in your working directory, you can call it with
./CleaveLand4.pl
For convenience, you can add it to your PATH. e.g.
sudo mv CleaveLand4.pl /usr/bin/
GSTAr.pl expects to find perl in /usr/bin/perl .. if not, edit line 1
(the hashbang) accordingly.
USAGE
CleaveLand4.pl [options] > [out.txt]
Log and progress information goes to STDERR, and can be suppressed with
option -q (quiet mode).
Options
-h Print help message and quit
-v Print version and quit
-q Quiet mode .. no log/progress information to STDERR
-a Sort small RNA / transcript alignments by Allen et al. score* instead
of default MFEratio -- for GSTAr
-t Output in tabular format instead of the default verbose format
-r [float >0..1] Minimum Free Energy Ratio cutoff. Default: 0.65 -- for
GSTAr
-o [string] : Produce T-plots in the directory indicated by the string.
If the dir does not exist, it will be created
-d [string] : Path to degradome density file.
-e [string] : Path to FASTA-formatted degradome reads.
-g [string] : Path to GSTAr-created tabular formatted query-transcript
alignments.
-u [string] : Path to FASTA-formatted small RNA queries
-n [string] : Path to FASTA-formatted transcriptome
-p [float >0..1] : p-value for reporting. Default is 1 (no p-value
filtering).
-c [integer 0..4] : Maximum category for reporting. Default is 4 (all
categories reported).
*Allen et al. score: This is a score based on the position-specific
penalties described by Allen et al. (2005) Cell, 121:207-221 [PMID:
15851028]. Specifically, mismatched query bases or target-bulged bases,
are penalized 1. G-U wobbles are penalized 0.5. These penalties are
double within positions 2-13 of the query.
Modes
CleaveLand4 runs in one of four different modes. Each mode has a
required set of options and a disallowed set of options, as described
below:
Mode 1: Align degradome data and create degradome density file, perform
new small RNA query/transcriptome alignment with GSTAr, and analyze.
Required options: -e, -u, -n. Disallowed options: -d, -g.
Mode 2: Use existing degradome density file, perform new small RNA
query/transcriptome alignment with GSTAr, and analyze. Required options:
-d, -u, -n. Disallowed options: -e, -g.
Mode 3: Align degradome data and create degradome density file, use
existing GSTAr alignments, and analyze. Required options: -e, -n, -g.
Disallowed options: -d, -u.
Mode 4: Use existing degradome density file and existing GSTAr
alignments, and analyze. Required options: -d, -g. Disallowed options:
-e, -u.
METHODS
Degradome data --> transcriptome alignments --> degradome density file creation (modes 1 and 3)
Degradome data is aligned to the reference transcriptome using bowtie.
If needed, the bowtie indices for the transcript are built with
bowtie-build using default parameters. This results in the creation of
six files, each including "ebwt" in their suffix. Alignment parameters
allow zero or one mismatch, and are only allowed to the forward strand
of the transcriptome. In the case of multiple valid alignments only one
is randomly selected and reported. (The specific bowtie command used is
"bowtie -f -v 1 --best -k 1 --norc -S"). The alignment process uses
samtools to generate a sorted BAM alignment file from the bowtie output
stream.
After creation of the sorted BAM alignment file, the alignments are
parsed to quantify the density of observed 5' ends at each nt of the
transcriptome. The results are written to a 'degradome density' file in
the working directory. The BAM alignment file is deleted at the
completion of this process. The degradome density files contain the
position of the transcript, the number of 5' ends at that position, and
the degradome peak 'category'. Categories are determined as follows:
Category 4: Just one read at that position
Category 3: >1 read, but below or equal to the average* depth of
coverage on the transcript
Category 2: >1 read, above the average* depth, but not the maximum on
the transcript
Cateogry 1: >1 read, equal to the maximum on the transcript, when there
is >1 position at maximum value
Cateogry 0: >1 read, equal to the maximum on the transcript, when there
is just 1 position at the the maximum value
* Note that the average does not include all of the 'zeroes' for
non-occupied positions within a transcript. Instead, it is the average
of all positions that have at least one read.
Small RNA --> transcriptome alignments with GSTAr (modes 1 and 2)
Potential target sites are generated with GSTAr.pl, which ships with
CleaveLand4. Options -r and -a are passed to GSTAr. By default,
potential target sites are sorted in descending order by MFE ratio. If
the -a switch is present, this is changed to ascending order based on
Allen et al. score. Note that GSTAr can be expected to take 90-120
seconds per query when analyzing a typically sized transcritpome.
GSTAr.pl is always called to output in tabular format by CleaveLand4.pl.
Only alignments to the reverse-comp strand of the transcriptome are
considered. Upon completion, a GSTAr alignment file is written to the
working directory. See the GSTAr documentation for more details on this
program.
Analysis (all modes)
After loading valid degradome density and GSTAr alignment files,
CleaveLand4 first checks to ensure that the transcriptome (as noted in
the headers) is the same. If so, analysis progresses. For each alignment
in the GSTAr alignment file, CleaveLand4 searches the degradome density
file to see if there are any degradome reads at the predicted slicing
site. If there are, a p-value is calculated. The p-value takes into
account both the noise in the degradome density file and the quality of
the small RNA-transcriptome alignment. First, the chances of observing a
deagrdome 'peak' of the given category by random chance is calculated.
The chance is the total number of peaks of the given category divided by
the effective transcriptome size*. Then, the quality of the alignment is
simply the rank of the alignment in the GSTAr alignment file (which is
either sorted by MFE ratio [default] or by Allen et al. score). The
p-value is calculated as the binomial probability of observing one or
more 'hits' in 'x' trials given probability 'c', where 'x' is the rank
of the alignment, and 'c' is the chance described above.
* The effective transcriptome size is the total number of bases in the
transcriptome - (n * mean_read_size), where 'n' is the number of
transcripts. This adjustment accounts for the fact that the very ends of
the transcripts could not possibly have any mapped 5' ends.
Any hits with a p-value <= the cutoff specified by option -p AND a
category <= the cutoff specified by option -c are output to STDOUT. By
default, all hits are reported (option p default is 1, option c default
is 4).
INPUT FILE FORMAT REQUIREMENTS
Newlines
All files are assumed to have "\n" as newline characters. Files with
MS-DOS text encoding, or others, that do not conform to this assumption
will cause unexpected behavior and likely meaningless results.
Transcriptome (option -n)
This must be a multiline FASTA file. The headers should be short and
simple and devoid of whitespace (e.g. ">AT1G12345" is good, ">AT1G12345
| this is my favorite gene | it is awesome" is not. The filename of the
transcriptome file should also be devoid of whitespace.
Degradome reads (option -e)
This must be a multiline FASTA file. The reads are assumed to have
already clipped to remove adapters. Furthermore, the reads must not have
been collapsed in any way. In other words, each read off the sequencer
should have an entry. Sequences that appear 50 times in the raw data
from 50 different reads should each have a line.
Finally, CleaveLand4 assumes that each degradome read represents the
5'-3' sequence of a transcript, and that the first nt of each read
represents the 5' end of an RNA.
Small RNA Queries (option -u)
This must be a multiline FASTA file with the full sequence of a given
small RNA on one line (e.g. each line is either a header beginning with
">" or the full-length sequence of the small RNA). The headers should be
short and simple and devoid of whitespace (e.g. ">ath-miR169a" is good,
">ath-miR169a MIMAT0000200 Arabidopsis thaliana miR169a" is not.
Sequences can have either T's or U's.
Note: miRBase often has several mature miRNAs with exactly the same
sequence, relfecting paralogous miRNA genes within a species. There is
no use querying the same sequence multiple times, so it is a good idea
to collapse the redundancy by query sequence when making a file of small
RNA queries.
Degradome density files (option -d)
Most of the time, these will be files created by previous runs of
CleaveLand4 that will have the suffix "_dd.txt". If you don't like the
alignment parameters that CleaveLand4 uses, you could create your own
degradome denstiy files. The format specification is:
Header region: Lines begin with "#". Here is an example:
[line1]# CleaveLand4 degradome density
[line2]# Fri Sep 13 09:21:38 EDT 2013
[line3]# Degradome Reads:GSM278335.fasta
[line4]# Transcriptome:TAIR10_cdna_20110103_rgmupdated_cleand.fasta
[line5]# TranscriptomeCharacters:51074197
[line6]# Mean Degradome Read Size:20
[line7]# Estimated effective Transcriptome Size:50402157
[line8]# Category 0:16430
[line9]# Category 1:3456
[line10]# Category 2:95747
[line11]# Category 3:207062
[line12]# Category 4:78279
CleaveLand4 demands that the first line of a valid degradome density
file is "# CleaveLand4 degradome density". It also requires all other
header lines, except the date on line 2, to be present. All of this
information is required for analysis.
Data Region: Each transcript begins with two lines as follows:
[line1]@ID:AT1G50920.1
[line2]@LN:2394
The @ID: gives the name of the transcript, while the @LN: gives the
length of the transcript. After this, each line gives a one-based
position, the number of 5' ends at that positions, and the degradome
category. The data lines are tab-delimited. Note that positons with zero
reads are NOT shown.
GSTAr query-transcriptome alignments (option -g)
These are files created by GSTAr. If they were created as part of a
CleaveLand4 run, they will have the suffix "_GSTAr.txt". They must be in
the 'tabular' format, and have a proper header as shown below:
[line1]# GSTAr version 1.0
[line2]# Thu Sep 12 13:56:58 EDT 2013
[line3]# Queries: test_mir.fasta
[line4]# Transcripts: TAIR10_cdna_20110103_rgmupdated_cleand.fasta
[line5]# Hit seed length required to initiate RNAplex analysis (option
-s): 7
[line6]# Minimum Free Energy Ratio cutoff (option -r): 0.65
[line7]# Sorted by: MFEratio
[line8]# Output Format: Tabular
Is is strongly recommended NOT to try to produce these files by means
not involving CleaveLand4/GSTAr.
OUTPUT
Pretty format
By default, CleaveLand prints hits that pass the p-value and category
filters to STDOUT in a human-readble, verbose format that is
self-explanatory. A header (lines beginning with "#") is printed giving
basic information on the analysis.
Tabular format
If option -t is specified, any hits passing the p-value filter are
printed in a tab-delimited format. First, a header (lines beginning with
"#") is printed giving basic information on the analysis. After that, a
line giving the names of the columns is printed. Each subsequent line
gives information on a single hit. The format is similar to that of the
GSTAr alignments. Column information is:
1: SiteID: A unique name (within the scope of the output of a particular
run) for the putative slicing site. In the form
"[transcript]:[slice_site]". The output is sorted by SiteID.
2: Query: Name of query
3: Transcript: Name of transcript
4: TStart: One-based start position of the alignment within the
transcript
5: TStop: One-based stop position of the alignment within the transcript
6: TSLice: One-based position of the alignment opposite position 10 of
the query
7: MFEperfect: Minimum free energy of a perfectly matched site
(approximate)
8: MFEsite: Minimum free energy of the alignment in question
9: MFEratio: MFEsite / MFEperfect
10: AllenScore: Penalty score calculated per Allen et al. (2005) Cell,
121:207-221 [PMID: 15851028].
11: Paired: String representing paired positions in the query and
transcript. The format is Query5'-Query3',Transcript3'-Transcript5'.
Positions are one-based. Discrete blocks of pairing are separated by ;
12: Unpaired: String representing unpaired positions in the query and
transcript. The format is
Query5'-Query3',Transcript3'-Transcript5'[code]. Possible codes are
"UP5" (Unpaired region at 5' end of query), "UP3" (Unpaired region at 3'
end of query), "SIL" (symmetric internal loop), "AILt" (asymmetric
internal loop with more unpaired nts on the transcript side), "AILq"
(asymmetric internal loop with more unpaired nts on the query side),
"BULt" (Bulged on the transcript side), or "BULq" (bulged on the query
side). Positions are one-based. Discrete blocks of pairing are separated
by ;
13: Structure: Aligned secondary structure. The region before the "&"
represents the transcript, 5'-3', while the region after the "&"
represents the query, 5'-3'. "(" represents a transcript base that is
paired, ")" represents a query based that is paired, "." represents an
unpaired base, and "-" represents a gap inserted to facilitate
alignment.
14: Sequence: Aligned sequence. The region before the "&" represents the
transcript, 5'-3', while the region after the "&" represents the query,
5'-3'.
15: DegradomeCategory:
Category 4: Just one read at that position
Category 3: >1 read, but below or equal to the average* depth of coverage on the transcript
Category 2: >1 read, above the average* depth, but not the maximum on the transcript
Cateogry 1: >1 read, equal to the maximum on the transcript, when there is >1 position at maximum value
Cateogry 0: >1 read, equal to the maximum on the transcript, when there is just 1 position at the the maximum value
16: DegradomePval: p-value for this degradome hit.
17: Tplot_file_path: File path for the T-plot of this hit.
* Note that the average does not include all of the 'zeroes' for
non-occupied positions within a transcript. Instead, it is the average
of all positions that have at least one read. =head2 T-plots
If the user requests them by including the -o option, T-plots for each
hit that passes the p-value cutoff are created and written to the
directory specificied by option -o. The black line on the plot shows all
of the degradome data, and the red dot shows the putative slicing site.
The title of each T-plot indicates the transcript ("T="), the query
("Q="), and the putative slicing site ("S="), as well as the category
and p-value.
Existing T-plot files in the -o directory with the same name will be
over-written without warning.
WARNINGS
Don't believe the hype - part 1
Under default settings, CleaveLand4 reports ALL putative slicing sites
with ANY degradome reads at all, regardless of the liklihood of a given
hit of being due to random chance. Without any filtering, most of your
hits probably ARE due to random chance. This means that there will be
many many hits of Category 4 (just one read) and/or at very high
p-values. Exercise skepticism when interpreting these results. More
confidence in the reality of a given slicing event can come from
restricting analysis to hits with low p-values and/or high categories,
and, even better, observing the slicing event in multiple degradome
libraries.
Don't believe the hype - part 2
The p-value calculation is built around the ASSUMPTION that the rank
order of alignments for a given query reflects their liklihood of being
functional. Under default settings, GSTAr will sort the alignments for
each query based on descending MFE ratio. Alternatively, when option a
is specified for the GSTAr run, they will be sorted in ascending order
according the Allen et al. score. The extent to which the p-values are
trustworthy is dictated directly by the extent to which you believe that
MFEratio or Allen et al. scores are predictive of function. If you don't
believe that, you should treat the p-values with due skepticism, and
focus instead on high category hits and reproducibility between
libraries.
Not for whole genomes
Degradome alignment by CleaveLand4 only searches the top strand of the
transcriptome. Also, GSTAr holds the entire contents of the
transcripts.fasta file in memory to speed the isolation of
sub-sequences, and CleaveLand4 holds the entire contents of the
degradome density file in memory. This will be impractical in terms of
memory usage if a user attempts to load a whole genome. Similarly, GSTAr
will only search for pairing between the top strand of the
transcripts.fasta file, making it also impractical for a genome
analysis, where sites might be on either strand.
Temp files
CleaveLand4 writes several temp files during the course of a run. So,
don't mess with them during a run. In addition, it is a very bad idea to
have two CleaveLand4 runs operating concurrently from the same working
directory. CleaveLand4 will clean up all temp files at the conclusion of
a run.
Not too fast in modes 1 and 2 (and maybe 3).
GSTAr is a very fast intermolecular RNA-RNA hybridization calculator.
But when applied to whole transcriptomes, it is still very
time-consuming. When running in modes 1 or 2, plan on about 90-120
seconds per query during the GSTAr phase. Additionally, bowtie
alignments and index bulding (modes 1 or 3) can also be time-consuming.
Finally, requesting T-plots can also slow things down, especially if a
great number of hits are being returned.
No ambiguity codes
Query sequences with characters other than A, T, U, C, or G
(case-insensitive) will not be analyzed, and a warning will be sent to
the user. Transcript sub-sequences for potential query alignments will
be *silently* ignored if they contain any characters other than A, T, U,
C,or G (case-insensitive).
Small queries
GSTAR demands that query sequences must be small (15-26 nts). Queries
that don't meet these size requirements will not be analyzed and a
warning sent to the user.
No redundancy
For a GIVEN QUERY, GSTAr alignments are non-redundant in terms of the
slicing site of the alignment. However, a single query can have multiple
overlapping alignment patterns that have differing predicted slicing
sites. In addition, if multiple queries are similar in sequence, the
same alignment (in terms of putative slicing site) can be present
multiple times for different queries. If CleaveLand4 identifies more
than one alignment at the same putative slicing site, it will only
report the one with the best (lowest) p-value (subject to the maximum
allowed p-value, option -p). Therefore there should be no redundancy in
the putative slice sites returned by a given CleaveLand4 run.
No reverse-compatibility
Degradome density files created by versions of CleaveLand prior to 4.0
are NOT compatible with CleaveLand 4. Sorry.
Change in category definitions
The categories used by CleaveLand4 differ slightly from those used in
CleaveLand3 and earlier. Specifically, categories 3 and 2 now rely upon
calculating the mean, not the median, level of coverage in the
transcript. In addition, transcript positions with zero coverage are no
longer included in the calculation. The effect of this is to make
category 2 hits much more rare, and category 3 hits much more common.
Slicing at position 10
CleaveLand4 only looks for evidence of slicing at position 10 relative
to the aligned small RNA. There is no ambiguity -- data at position 11
or 9 is not relevant to CleaveLand4. This is because, as far as I know,
there is no direct evidence showing Argonaute proteins cut anywhere
besides position 10. However, there IS clear evidence for isomirs:
lower-abundance variants of miRNAs with alternative 5' or 3' ends.
Isomirs with alternative 5' ends could certainly cause offset slicing.
If you wish to search for slicing at slightly 'off' locations with
CleaveLand4, you will need to explictly query with the isomirs of
interest.