Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Sep 29, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

…ffer for 'Struct Star Expansion' test

### What changes were proposed in this pull request?

This PR is a minor followup of #37973 to match the pattern of either `ArrayBuffer` (Scala 2.12) or `List` (Scala 2.13) in "Struct Star Expansion" test case at `SQLQuerySuite.scala`.

### Why are the changes needed?

Currently, the test fails with Scala 2.13 (https://github.com/apache/spark/actions/runs/3146198025/jobs/5114388079):

```
[info] - Struct Star Expansion *** FAILED *** (1 second, 801 milliseconds)
[info]   Map("attributes" -> "List(a)") did not equal Map("attributes" -> "ArrayBuffer(a)") (SparkFunSuite.scala:328)
[info]   Analysis:
[info]   JavaCollectionWrappers$JMapWrapper(attributes: List(a) -> ArrayBuffer(a))
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.SparkFunSuite.checkError(SparkFunSuite.scala:328)
[info]   at org.apache.spark.SparkFunSuite.checkError(SparkFunSuite.scala:369)
```

We should also consider Scala 2.13 case.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested:

```scala
scala> "List(a)".matches("(ArrayBuffer|List)\\(a\\)")
res0: Boolean = true

scala> "ArrayBuffer(a)".matches("(ArrayBuffer|List)\\(a\\)")
res1: Boolean = true
```

Closes #38045 from HyukjinKwon/SPARK-40540-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@pull pull bot added the ⤵️ pull label Sep 29, 2022
@github-actions github-actions bot added the SQL label Sep 29, 2022
…nfigure test log output

### What changes were proposed in this pull request?
This pr make `connect` module change to use to use `log4j2.properties` to configure test log output as others modules.

### Why are the changes needed?
You should use `log4j2.properties`  to configure logs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #38041 from LuciferYang/connect-log.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
### What changes were proposed in this pull request?

Bump kubernetes-client version from 5.12.3 to 6.1.1 and clean up all the deprecations.

### Why are the changes needed?

To keep up with kubernetes-client [changes](fabric8io/kubernetes-client@v5.12.3...v6.1.1).
As this is an upgrade where the main version changed I have cleaned up all the deprecations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

#### Unit tests

#### Manual tests for submit and application management

Started an application in a non-default namespace (`bla`):

```
➜  spark git:(SPARK-40458) ✗ ./bin/spark-submit \
    --master k8s://http://127.0.0.1:8001 \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.namespace=bla \
    --conf spark.kubernetes.container.image=docker.io/kubespark/spark:3.4.0-SNAPSHOT_064A99CC-57AF-46D5-B743-5B12692C260D \
    local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar 200000
```

Check that we cannot find it in the default namespace even with glob without the namespace definition:

```
➜  spark git:(SPARK-40458) ✗ minikube kubectl -- config set-context --current --namespace=default
Context "minikube" modified.
➜  spark git:(SPARK-40458) ✗ ./bin/spark-submit --status "spark-pi-*" --master k8s://http://127.0.0.1:8001
Submitting a request for the status of submission spark-pi-* in k8s://http://127.0.0.1:8001.
No applications found.
```

Then check we can find it by specifying the namespace:
```
➜  spark git:(SPARK-40458) ✗ ./bin/spark-submit --status "bla:spark-pi-*" --master k8s://http://127.0.0.1:8001
Submitting a request for the status of submission bla:spark-pi-* in k8s://http://127.0.0.1:8001.
Application status (driver):
         pod name: spark-pi-4c4e70837c86ae1a-driver
         namespace: bla
         labels: spark-app-name -> spark-pi, spark-app-selector -> spark-c95a9a0888214c01a286eb7ba23980a0, spark-role -> driver, spark-version -> 3.4.0-SNAPSHOT
         pod uid: 0be8952e-3e00-47a3-9082-9cb45278ed6d
         creation time: 2022-09-27T01:19:06Z
         service account name: default
         volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-wxnqw
         node name: minikube
         start time: 2022-09-27T01:19:06Z
         phase: Running
         container status:
                 container name: spark-kubernetes-driver
                 container image: kubespark/spark:3.4.0-SNAPSHOT_064A99CC-57AF-46D5-B743-5B12692C260D
                 container state: running
                 container started at: 2022-09-27T01:19:07Z
```

Changing the namespace to `bla` with `kubectl`:

```
➜  spark git:(SPARK-40458) ✗  minikube kubectl -- config set-context --current --namespace=bla
Context "minikube" modified.
```

Checking we can find it without specifying the namespace (and glob):
```
➜  spark git:(SPARK-40458) ✗  ./bin/spark-submit --status "spark-pi-*" --master k8s://http://127.0.0.1:8001
Submitting a request for the status of submission spark-pi-* in k8s://http://127.0.0.1:8001.
Application status (driver):
         pod name: spark-pi-4c4e70837c86ae1a-driver
         namespace: bla
         labels: spark-app-name -> spark-pi, spark-app-selector -> spark-c95a9a0888214c01a286eb7ba23980a0, spark-role -> driver, spark-version -> 3.4.0-SNAPSHOT
         pod uid: 0be8952e-3e00-47a3-9082-9cb45278ed6d
         creation time: 2022-09-27T01:19:06Z
         service account name: default
         volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-wxnqw
         node name: minikube
         start time: 2022-09-27T01:19:06Z
         phase: Running
         container status:
                 container name: spark-kubernetes-driver
                 container image: kubespark/spark:3.4.0-SNAPSHOT_064A99CC-57AF-46D5-B743-5B12692C260D
                 container state: running
                 container started at: 2022-09-27T01:19:07Z
```

Killing the app:
```
➜  spark git:(SPARK-40458) ✗  ./bin/spark-submit --kill "spark-pi-*" --master k8s://http://127.0.0.1:8001
Submitting a request to kill submission spark-pi-* in k8s://http://127.0.0.1:8001. Grace period in secs: not set.
Deleting driver pod: spark-pi-4c4e70837c86ae1a-driver.
```

Closes #37990 from attilapiros/SPARK-40458.

Authored-by: attilapiros <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR upgrade sbt-protoc to 1.0.6.

### Why are the changes needed?

sbt-protoc 1.0.1 dose not exist.
<img width="993" alt="image" src="https://user-images.githubusercontent.com/5399861/193038436-0096b263-0c2d-4f7d-8fac-9b3f97089490.png">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #38049 from wangyum/sbt-protoc.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@pull pull bot merged commit bfad44e into wangyum:master Sep 29, 2022
pull bot pushed a commit that referenced this pull request Dec 23, 2022
…n Aggregate

### What changes were proposed in this pull request?

This PR implements the implicit lateral column alias on `Aggregate` case. For example,
```sql
-- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias
SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus)
FROM employee
GROUP BY dept
```

The high level implementation idea is to insert the `Project` node above, and falling back to the resolution of lateral alias of Project code path in the last PR.

* Phase 1: recognize resolved lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`
* Phase 2: when the `Aggregate` operator is resolved, it goes through the whole aggregation list, extracts the aggregation expressions and grouping expressions to keep them in this `Aggregate` node, and add a `Project` above with the original output. It doesn't do anything on `LateralColumnAliasReference`, but completely leave it to the Project in the future turns of this rule.

Example:
```
 // Before rewrite:
 Aggregate [dept#14] [dept#14 AS a#12, 'a + 1, avg(salary#16) AS b#13, 'b + avg(bonus#17)]
 +- Child [dept#14,name#15,salary#16,bonus#17]

 // After phase 1:
 Aggregate [dept#14] [dept#14 AS a#12, lca(a) + 1, avg(salary#16) AS b#13, lca(b) + avg(bonus#17)]
 +- Child [dept#14,name#15,salary#16,bonus#17]

 // After phase 2:
 Project [dept#14 AS a#12, lca(a) + 1, avg(salary)#26 AS b#13, lca(b) + avg(bonus)#27]
 +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14]
     +- Child [dept#14,name#15,salary#16,bonus#17]

 // Now the problem falls back to the lateral alias resolution in Project.
 // After future rounds of this rule:
 Project [a#12, a#12 + 1, b#13, b#13 + avg(bonus)#27]
 +- Project [dept#14 AS a#12, avg(salary)#26 AS b#13]
    +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14]
       +- Child [dept#14,name#15,salary#16,bonus#17]
```

Similar as the last PR (apache#38776), because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`.

### Why are the changes needed?
Similar as stated in apache#38776.

### Does this PR introduce _any_ user-facing change?

Yes, as shown in the above example, it will be able to resolve lateral column alias in Aggregate.

### How was this patch tested?

Existing tests and newly added tests.

Closes apache#39040 from anchovYu/SPARK-27561-agg.

Authored-by: Xinyi Yu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
pull bot pushed a commit that referenced this pull request Apr 22, 2023
…onnect

### What changes were proposed in this pull request?
Implement Arrow-optimized Python UDFs in Spark Connect.

Please see apache#39384 for motivation and  performance improvements of Arrow-optimized Python UDFs.

### Why are the changes needed?
Parity with vanilla PySpark.

### Does this PR introduce _any_ user-facing change?
Yes. In Spark Connect Python Client, users can:

1. Set `useArrow` parameter True to enable Arrow optimization for a specific Python UDF.

```sh
>>> df = spark.range(2)
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show()
+------------+
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#18 AS <lambda>(id)#16]
+- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200
   +- *(1) Range (0, 2, step=1, splits=1)
```

2. Enable `spark.sql.execution.pythonUDF.arrow.enabled` Spark Conf to make all Python UDFs Arrow-optimized.

```sh
>>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True)
>>> df.select(udf(lambda x : x + 1)('id')).show()
+------------+
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#30 AS <lambda>(id)#28]
+- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200
   +- *(1) Range (0, 2, step=1, splits=1)

```

### How was this patch tested?
Parity unit tests.

Closes apache#40725 from xinrong-meng/connect_arrow_py_udf.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants