Skip to content

[SPARK-30682][SPARKR][SQL] Add SparkR interface for higher order functions#27433

Closed
zero323 wants to merge 6 commits intoapache:masterfrom
zero323:SPARK-30682
Closed

[SPARK-30682][SPARKR][SQL] Add SparkR interface for higher order functions#27433
zero323 wants to merge 6 commits intoapache:masterfrom
zero323:SPARK-30682

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Feb 2, 2020

What changes were proposed in this pull request?

This PR add R API for invoking following higher functions:

  • transform -> array_transform (to avoid conflict with base::transform).
  • exists -> array_exists (to avoid conflict with base::exists).
  • forall -> array_forall (no conflicts, renamed for consistency)
  • filter -> array_filter (to avoid conflict with stats::filter).
  • aggregate -> array_aggregate (to avoid conflict with stats::transform).
  • zip_with -> arrays_zip_with (no conflicts, renamed for consistency)
  • transform_keys
  • transform_values
  • map_filter
  • map_zip_with

Overall implementation follows the same pattern as proposed for PySpark (#27406) and reuses object supporting Scala implementation (SPARK-27297).

Why are the changes needed?

Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions:

select(df, expr("transform(xs, x -> x + 1)")

This is error-prone, and hard to do right, when complex logic is used (when / otherwise, complex objects).

If this PR is accepted, above function could be simply rewritten as:

select(df, transform("xs", function(x) x + 1))

Does this PR introduce any user-facing change?

No (but new user-facing functions are added).

How was this patch tested?

Added new unit tests.

@SparkQA
Copy link

SparkQA commented Feb 2, 2020

Test build #117731 has finished for PR 27433 at commit 909bfa8.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2020

Test build #117743 has finished for PR 27433 at commit 40d7b23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2020

Test build #117747 has finished for PR 27433 at commit f918098.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Nice! cc @felixcheung and @shivaram.

@HyukjinKwon
Copy link
Member

cc @falaki too fyi

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117796 has finished for PR 27433 at commit cefdc0a.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117801 has finished for PR 27433 at commit 6dcf2b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Let me merge in few days after taking a final look if other committers can't find some time to review.

parameters <- formals(fun)
nparameters <- length(parameters)

stopifnot(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323, can we remove this one too here for now? Let's discuss and figure out a better way in the next PR about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to some lesser extent (as variation of argument type is smaller, but so is amount of logic required), but overall same as here ‒ #27406 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let's talk in #27406 (comment)

@SparkQA
Copy link

SparkQA commented Feb 6, 2020

Test build #117987 has finished for PR 27433 at commit c68d58f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Feb 8, 2020 via email

@SparkQA
Copy link

SparkQA commented Feb 8, 2020

Test build #118074 has finished for PR 27433 at commit 6672c14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2020

Test build #118078 has finished for PR 27433 at commit 23f2c00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

Okay, let's merge this in. I will take a separate look for a followup if it's needed.

@SparkQA
Copy link

SparkQA commented Feb 25, 2020

Test build #118891 has finished for PR 27433 at commit 23f2c00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@zero323
Copy link
Member Author

zero323 commented Feb 28, 2020

Thanks a bunch for your support @HyukjinKwon @felixcheung!

@zero323 zero323 deleted the SPARK-30682 branch February 28, 2020 10:52
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…tions

### What changes were proposed in this pull request?

This PR add R API for invoking following higher functions:

- `transform` -> `array_transform` (to avoid conflict with `base::transform`).
- `exists` -> `array_exists` (to avoid conflict with `base::exists`).
- `forall` -> `array_forall` (no conflicts, renamed for consistency)
- `filter` -> `array_filter` (to avoid conflict with `stats::filter`).
- `aggregate` -> `array_aggregate` (to avoid conflict with `stats::transform`).
- `zip_with` -> `arrays_zip_with` (no conflicts, renamed for consistency)
- `transform_keys`
- `transform_values`
- `map_filter`
- `map_zip_with`

Overall implementation follows the same pattern as proposed for PySpark (apache#27406) and reuses object supporting Scala implementation (SPARK-27297).

### Why are the changes needed?

Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions:

```r
select(df, expr("transform(xs, x -> x + 1)")
```

This is error-prone, and hard to do right, when complex logic is used (`when` / `otherwise`, complex objects).

If this PR is accepted, above function could be simply rewritten as:

```r
select(df, transform("xs", function(x) x + 1))
```

### Does this PR introduce any user-facing change?

No (but new user-facing functions are added).

### How was this patch tested?

Added new unit tests.

Closes apache#27433 from zero323/SPARK-30682.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Jan 18, 2021
…e in higher order functions

### What changes were proposed in this pull request?

This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with #31062.

Note that:

- there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary.
- such naming matching seems already pretty common in SparkR.

### Why are the changes needed?

To make the usage similar with Scala side, and for consistency.

### Does this PR introduce _any_ user-facing change?

No, this is not released yet.

### How was this patch tested?

GitHub Actions and Jenkins build will test it out.

Also, I manually tested:

```r
> df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs"))
> collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y))))
  aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id))
1                                                                                                     2
```

Closes #31226 from HyukjinKwon/SPARK-30682.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Jan 18, 2021
…e in higher order functions

### What changes were proposed in this pull request?

This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with #31062.

Note that:

- there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary.
- such naming matching seems already pretty common in SparkR.

### Why are the changes needed?

To make the usage similar with Scala side, and for consistency.

### Does this PR introduce _any_ user-facing change?

No, this is not released yet.

### How was this patch tested?

GitHub Actions and Jenkins build will test it out.

Also, I manually tested:

```r
> df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs"))
> collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y))))
  aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id))
1                                                                                                     2
```

Closes #31226 from HyukjinKwon/SPARK-30682.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit b5bdbf2)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…e in higher order functions

### What changes were proposed in this pull request?

This PR is a followup of apache#27433. It fixes the naming to match with Scala side, and this is similar with apache#31062.

Note that:

- there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary.
- such naming matching seems already pretty common in SparkR.

### Why are the changes needed?

To make the usage similar with Scala side, and for consistency.

### Does this PR introduce _any_ user-facing change?

No, this is not released yet.

### How was this patch tested?

GitHub Actions and Jenkins build will test it out.

Also, I manually tested:

```r
> df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs"))
> collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y))))
  aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id))
1                                                                                                     2
```

Closes apache#31226 from HyukjinKwon/SPARK-30682.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments