[SPARK-30682][SPARKR][SQL] Add SparkR interface for higher order functions by zero323 · Pull Request #27433 · apache/spark

zero323 · 2020-02-02T07:13:31Z

What changes were proposed in this pull request?

This PR add R API for invoking following higher functions:

transform -> array_transform (to avoid conflict with base::transform).
exists -> array_exists (to avoid conflict with base::exists).
forall -> array_forall (no conflicts, renamed for consistency)
filter -> array_filter (to avoid conflict with stats::filter).
aggregate -> array_aggregate (to avoid conflict with stats::transform).
zip_with -> arrays_zip_with (no conflicts, renamed for consistency)
transform_keys
transform_values
map_filter
map_zip_with

Overall implementation follows the same pattern as proposed for PySpark (#27406) and reuses object supporting Scala implementation (SPARK-27297).

Why are the changes needed?

Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions:

select(df, expr("transform(xs, x -> x + 1)")

This is error-prone, and hard to do right, when complex logic is used (when / otherwise, complex objects).

If this PR is accepted, above function could be simply rewritten as:

select(df, transform("xs", function(x) x + 1))

Does this PR introduce any user-facing change?

No (but new user-facing functions are added).

How was this patch tested?

Added new unit tests.

SparkQA · 2020-02-02T07:57:03Z

Test build #117731 has finished for PR 27433 at commit 909bfa8.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-02T13:59:42Z

Test build #117743 has finished for PR 27433 at commit 40d7b23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-02T17:49:03Z

Test build #117747 has finished for PR 27433 at commit f918098.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-03T01:38:04Z

Nice! cc @felixcheung and @shivaram.

HyukjinKwon · 2020-02-03T01:38:13Z

cc @falaki too fyi

R/pkg/R/functions.R

SparkQA · 2020-02-04T02:38:10Z

Test build #117796 has finished for PR 27433 at commit cefdc0a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-04T04:16:50Z

Test build #117801 has finished for PR 27433 at commit 6dcf2b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Looks fine. Let me merge in few days after taking a final look if other committers can't find some time to review.

R/pkg/R/functions.R

HyukjinKwon · 2020-02-06T03:41:40Z

R/pkg/R/functions.R

+  parameters <- formals(fun)
+  nparameters <- length(parameters)
+
+  stopifnot(


@zero323, can we remove this one too here for now? Let's discuss and figure out a better way in the next PR about this.

Maybe to some lesser extent (as variation of argument type is smaller, but so is amount of logic required), but overall same as here ‒ #27406 (comment).

Yup, let's talk in #27406 (comment)

SparkQA · 2020-02-06T11:17:41Z

Test build #117987 has finished for PR 27433 at commit c68d58f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2020-02-08T18:33:52Z

Generally it looks good. Thanks for working on this. Please see functions.R L247 for extra space that shouldn’t be there

SparkQA · 2020-02-08T19:31:14Z

Test build #118074 has finished for PR 27433 at commit 6672c14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-08T22:17:51Z

Test build #118078 has finished for PR 27433 at commit 23f2c00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-25T03:34:05Z

retest this please

HyukjinKwon · 2020-02-25T03:34:28Z

Okay, let's merge this in. I will take a separate look for a followup if it's needed.

SparkQA · 2020-02-25T04:24:37Z

Test build #118891 has finished for PR 27433 at commit 23f2c00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-28T03:58:40Z

Merged to master.

zero323 · 2020-02-28T10:52:53Z

Thanks a bunch for your support @HyukjinKwon @felixcheung!

…tions ### What changes were proposed in this pull request? This PR add R API for invoking following higher functions: - `transform` -> `array_transform` (to avoid conflict with `base::transform`). - `exists` -> `array_exists` (to avoid conflict with `base::exists`). - `forall` -> `array_forall` (no conflicts, renamed for consistency) - `filter` -> `array_filter` (to avoid conflict with `stats::filter`). - `aggregate` -> `array_aggregate` (to avoid conflict with `stats::transform`). - `zip_with` -> `arrays_zip_with` (no conflicts, renamed for consistency) - `transform_keys` - `transform_values` - `map_filter` - `map_zip_with` Overall implementation follows the same pattern as proposed for PySpark (apache#27406) and reuses object supporting Scala implementation (SPARK-27297). ### Why are the changes needed? Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions: ```r select(df, expr("transform(xs, x -> x + 1)") ``` This is error-prone, and hard to do right, when complex logic is used (`when` / `otherwise`, complex objects). If this PR is accepted, above function could be simply rewritten as: ```r select(df, transform("xs", function(x) x + 1)) ``` ### Does this PR introduce any user-facing change? No (but new user-facing functions are added). ### How was this patch tested? Added new unit tests. Closes apache#27433 from zero323/SPARK-30682. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…e in higher order functions ### What changes were proposed in this pull request? This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with #31062. Note that: - there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary. - such naming matching seems already pretty common in SparkR. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Also, I manually tested: ```r > df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs")) > collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y)))) aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id)) 1 2 ``` Closes #31226 from HyukjinKwon/SPARK-30682. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…e in higher order functions ### What changes were proposed in this pull request? This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with #31062. Note that: - there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary. - such naming matching seems already pretty common in SparkR. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Also, I manually tested: ```r > df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs")) > collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y)))) aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id)) 1 2 ``` Closes #31226 from HyukjinKwon/SPARK-30682. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b5bdbf2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…e in higher order functions ### What changes were proposed in this pull request? This PR is a followup of apache#27433. It fixes the naming to match with Scala side, and this is similar with apache#31062. Note that: - there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary. - such naming matching seems already pretty common in SparkR. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Also, I manually tested: ```r > df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs")) > collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y)))) aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id)) 1 2 ``` Closes apache#31226 from HyukjinKwon/SPARK-30682. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Add SparkR interface for higher order functions

f918098

zero323 force-pushed the SPARK-30682 branch from 40d7b23 to f918098 Compare February 2, 2020 17:10

HyukjinKwon reviewed Feb 3, 2020

View reviewed changes

R/pkg/R/functions.R Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 3, 2020

View reviewed changes

R/pkg/R/functions.R Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 3, 2020

View reviewed changes

R/pkg/R/functions.R Outdated Show resolved Hide resolved

zero323 added 2 commits February 3, 2020 12:23

Replace sparkR.newJObject with newJObject

0448c1e

Replace callJStatic with handledCallJStatic

f9d8f4d

zero323 mentioned this pull request Feb 3, 2020

[SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark #27406

Closed

Drop arity checks

6dcf2b1

zero323 force-pushed the SPARK-30682 branch from cefdc0a to 6dcf2b1 Compare February 4, 2020 03:37

zero323 requested a review from HyukjinKwon February 4, 2020 09:58

dongjoon-hyun added SPARKR SQL labels Feb 5, 2020

HyukjinKwon approved these changes Feb 6, 2020

View reviewed changes

HyukjinKwon reviewed Feb 6, 2020

View reviewed changes

R/pkg/R/functions.R Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 6, 2020

View reviewed changes

R/pkg/R/functions.R Show resolved Hide resolved

HyukjinKwon reviewed Feb 6, 2020

View reviewed changes

sparkR.newJObject -> newJObject

c68d58f

Remove leading whitespace

23f2c00

zero323 force-pushed the SPARK-30682 branch from 6672c14 to 23f2c00 Compare February 8, 2020 21:36

HyukjinKwon closed this in c467961 Feb 28, 2020

zero323 deleted the SPARK-30682 branch February 28, 2020 10:52

HyukjinKwon mentioned this pull request Jan 18, 2021

[SPARK-30682][R][SQL][FOLLOW-UP] Keep the name similar with Scala side in higher order functions #31226

Closed

Conversation

zero323 commented Feb 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Feb 2, 2020

Uh oh!

SparkQA commented Feb 2, 2020

Uh oh!

SparkQA commented Feb 2, 2020

Uh oh!

HyukjinKwon commented Feb 3, 2020

Uh oh!

HyukjinKwon commented Feb 3, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 4, 2020

Uh oh!

SparkQA commented Feb 4, 2020

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

zero323 Feb 6, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 7, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 6, 2020

Uh oh!

felixcheung commented Feb 8, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 8, 2020

Uh oh!

SparkQA commented Feb 8, 2020

Uh oh!

HyukjinKwon commented Feb 25, 2020

Uh oh!

HyukjinKwon commented Feb 25, 2020

Uh oh!

SparkQA commented Feb 25, 2020

Uh oh!

HyukjinKwon commented Feb 28, 2020

Uh oh!

zero323 commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

zero323 commented Feb 2, 2020 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

felixcheung commented Feb 8, 2020 via email •

edited

Loading

zero323 commented Feb 28, 2020 •

edited

Loading