Positional annotation of lncRNAs in GTF file
Annotator
By Rory Johnson, University of Bern: [email protected] or [email protected] or www.gold-lab.org
#VERSION 2: change intergenic lncRNA-protcod distance estimation so now the reported distance is that given by BEDTOOLS, ie the nearest to nearest distance, NOT the promoter-promoter distance as previously.
#VERSION 3: change the commands that create the lncRNA and pc BED files (genes, transcripts, exons) to be based on perl pattern matching, so that they are not sensitive to the order that things are defined in the description column of the GTF, which can sometimes change.
#VERSION 5: follows from V3 (except fragment from V4, where indicated). Remove the extra V4 functionality but here fixing the problem with unstranded transcripts.
Input files: Two separate GTF format files - (1) lncRNA annotation and (2) full annotation, both available from gencodegenes.org.
Format: LncRNA_Transcript_ID, Nearest_protein_coding_gene_ID, Positional_annotation_class, Distance (bp)
Distances (ONLY for genic lncRNAs): negative=lncRNA upstream of protcod, positive= lncRNA downsteram of protcod
#(0) unstranded, intergenic #(1) samestrand, lincRNA upstream / #(2) divergent / #(3) samestrand, protcod upstream / #(4) convergent / #(5) intronic_AS / #(6) intronic_SS / #(7) exonic_AS / #(8) exonic_SS #(9) unstranded, genic