-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-20437][R] R wrappers for rollup and cube #17728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
dcc359f
bc0401b
7af59e3
132099c
9760239
396cf55
ab05919
a320327
f4fa32f
caeafdb
e9bbe6f
7d6c6d5
76f12cd
ee73dd8
0da03b2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3642,3 +3642,58 @@ setMethod("checkpoint", | |
| df <- callJMethod(x@sdf, "checkpoint", as.logical(eager)) | ||
| dataFrame(df) | ||
| }) | ||
|
|
||
|
|
||
| #' cube | ||
| #' | ||
| #' Create a multi-dimensional cube for the SparkDataFrame using the specified columns. | ||
| #' | ||
| #' @param x a SparkDataFrame. | ||
| #' @param ... character name(s) or Column(s) to group on. | ||
| #' @return A GroupedData. | ||
| #' @family SparkDataFrame functions | ||
| #' @aliases cube,SparkDataFrame-method | ||
| #' @rdname cube | ||
| #' @name cube | ||
| #' @export | ||
| #' @examples | ||
| #' \dontrun{ | ||
| #' df <- createDataFrame(mtcars) | ||
| #' mean(cube(df, "cyl", "gear", "am"), "mpg") | ||
| #' } | ||
| #' @note cube since 2.3.0 | ||
| setMethod("cube", | ||
| signature(x = "SparkDataFrame"), | ||
| function(x, ...) { | ||
| cols <- list(...) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check length of cols is > 0?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If think we can skip that.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, it's a bit odd to call rollup or cube that way but ok if other languages leave that open too. but I'd say we should add a line to explain "rollup or cube without column is the same as group_by" (or something better)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you want to support empty parameter let's add some tests for it then? |
||
| jcol <- lapply(cols, function(x) if (class(x) == "Column") x@jc else column(x)@jc) | ||
| sgd <- callJMethod(x@sdf, "cube", jcol) | ||
| groupedData(sgd) | ||
| }) | ||
|
|
||
| #' rollup | ||
| #' | ||
| #' Create a multi-dimensional rollup for the SparkDataFrame using the specified columns. | ||
| #' | ||
| #' @param x a SparkDataFrame. | ||
| #' @param ... character name(s) or Column(s) to group on. | ||
| #' @return A GroupedData. | ||
| #' @family SparkDataFrame functions | ||
| #' @aliases rollup,SparkDataFrame-method | ||
| #' @rdname rollup | ||
| #' @name rollup | ||
| #' @export | ||
| #' @examples | ||
| #' \dontrun{ | ||
| #' df <- createDataFrame(mtcars) | ||
| #' mean(rollup(df, "cyl", "gear", "am"), "mpg") | ||
| #' } | ||
| #' @note rollup since 2.3.0 | ||
| setMethod("rollup", | ||
| signature(x = "SparkDataFrame"), | ||
| function(x, ...) { | ||
| cols <- list(...) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check length of cols |
||
| jcol <- lapply(cols, function(x) if (class(x) == "Column") x@jc else column(x)@jc) | ||
| sgd <- callJMethod(x@sdf, "rollup", jcol) | ||
| groupedData(sgd) | ||
| }) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -308,6 +308,21 @@ numCyl <- summarize(groupBy(carsDF, carsDF$cyl), count = n(carsDF$cyl)) | |
| head(numCyl) | ||
| ``` | ||
|
|
||
| `groupBy` can be replaced with `cube` or `rollup` to compute subtotals across multiple dimensions. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor: I wouldn't say replace because they are not functionally the same?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you think the programming guide can use updates too?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I keep forgetting there is one. I think we can add a few lines. This is actually a pretty neat feature. |
||
|
|
||
| ```{r} | ||
| mean(cube(carsDF, "cyl", "gear", "am"), "mpg") | ||
| ``` | ||
|
|
||
| generates groupings for {(`cyl`, `gear`, `am`), (`cyl`, `gear`), (`cyl`), ()}, while | ||
|
|
||
| ```{r} | ||
| mean(rollup(carsDF, "cyl", "gear", "am"), "mpg") | ||
| ``` | ||
|
|
||
| generates groupings for all possible combinations of grouping columns. | ||
|
|
||
|
|
||
| #### Operating on Columns | ||
|
|
||
| SparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra newline