Set of utility functions for use with GenomicRanges
-
Install R-3.5
-
Install devtools
install.packages('devtools')
install.packages('testthat')
- Install gUtils and dependent packages
## allows dependencies that throw warnings to install
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS = TRUE)
devtools::install_github('mskilab/gUtils')
Among other features, gUtils
provides syntactic sugar on top of basic GenomicRanges
functionality, enabling easy piping of interval operations as part of interactive "genomic data science" exploration in R. In all these examples a
and b
are GRanges
(e.g a
are gene territories and b
might be copy number segments or Chip-Seq peaks).
Subsets or re-orders a
based on a logical or integer valued expression that operates on the GRanges
metadata columns of a
.
a %Q% (expr)
a %Q% (col1 == "value" & col2 > 0 & col3 < 100)
a %Q% (order(col1))
Performs "natural join" or merge of metadata columns of a
and b
using interval overlap as a "primary key", outputs a new GRanges
whose maximum length is length(a)*length(b)
. (See gr.findoverlaps
for more complex queries, including by
argument that merging based on a hybrid primary key combining both metadata and interval territories).
a %*% b # strand-agnostic merging
a %**% b # strand-specific merging
## more expressive merges
gr.findoverlaps(a, b,
by = 'column_in_both_a_and_b', qcol = c('acolumn1', 'acolumn2'), scol = c('bcolumn1', 'bcolumn2'))
Aggregates the metadata in b
across the territory of each range in a
. This returns a
appended with additional metadata columns of b
with values aggregated over the a
and b
overlap. For character or factor-valued metadata columns of b
, aggregation will return a comma collapsed character value of all b
values (e.g. gene names) that overlap a[i]
. For numeric columns of b
it will return the width-weighted mean value (e.g. peak intensity) of that column across the a[i]
and b
overlap. For custom aggregations please see gr.val
function.
a %$% b # strand-agnostic aggregation
a %$$% b # strand-specific aggregation
# for additional customization
# gr.val aggregates and casts data using levels of column "sample_id"
# and a custom function (e.g. max, mode, median) that takes three values as input,
# where width refers to the width of the overlaps between a[i] and b[jj]
gr.val(a, b, val = c('field1', 'field2'),
by = 'sample_id', FUN = function(value, width, is.na) my_cool_fn(value, width, is.na))
Return the subset of ranges in a
that overlap with at least one range in b
.
a %&% b # strand-agnostic
a %&&% b # strand-specific
Returns a length(a)
numeric vector whose item i
is the fraction of the width of a[i]
that overlaps at least one range in b
.
a %O% b # strand-agnostic
a %OO% b # strand-specific
Returns a length(a)
numeric vector whose item i
is the number of bases in a[i]
that overlaps at least one range in b
.
a %o% b # strand-agnostic
a %oo% b # strand-specific
Returns a length(a)
numeric vector whose item i
is the total number of ranges in b
that overlap with a[i]
.
a %N% b # strand-agnostic
a %NN% b # strand-specific
Returns a length(a)
logical vector whose item i
TRUE if the a[i]
overlaps at least on range in b
(similar to %over%
just less fussy about Seqinfo
).
a %^% b # strand-agnostic
a %^^% b # strand-specific
Returns a length(a)
integer vector whose item i
contains the first index in b
overlapping a[i]
(this function is the match cousin to %over%
and %^%
).
gr.match(a, b) # strand-agnostic
gr.match(a, b, ignore.strand = FALSE) # strand-specific
gr.match(a, b, by = 'sample_id') # match on metadata column "sample_id" as well as interval
Shifts intervals right by k
bases.
a %+% k
Shifts intervals left by k
bases.
a %-% k
Tiles a
or the genome in which a
resides (as defined by seqlengths(a)
) with non-overlapping bins of width w
.
gr.tile(a, w) ## outputs non-overlapping tiles of a
gr.tilexs(seqlengths(a), w) ## outputs non-overlapping tiles of a's genome
gr.tile(seqlengths(a), 100)+450 # tiles a's genome with 1kbp bins having 900bp overlap
Returns a GRanges
of the first coordinate (or first k coordinates) in each interval (in a strand agnostic or specific manner)
gr.start(a) # returns the an interval corresponding to the left coordinate
gr.start(a, k) # returns the first k bases on the left end of a
# returns an interval corresponding to the left coordinate in '+' and '*' ranges and the right coordinate in '-' ranges
gr.start(a, ignore.strand = FALSE)
Returns a GRanges
of the last coordinate (or last k coordinates) in each interval (in a strand agnostic or specific manner)
gr.end(a) # returns the an interval corresponding to the right coordinate
gr.end(a, k) # returns the last k bases on the right end of a
# returns an interval corresponding to the right coordinate in '+' and '*' ranges and the left coordinate in '-' ranges
gr.end(a, ignore.strand = FALSE)
Full documentation with examples is available here: Documentation
Marcin Imielinski - Assistant Professor, Weill Cornell Medicine; Core Member, New York Genome Center
Jeremiah Wala - Harvard MD-PhD candidate, Bioinformatics and Integrative Genomics, Rameen Beroukhim Lab, Dana Farber Cancer Institute