[pull] master from apache:master #27

pull · 2022-09-29T14:13:15Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…ffer for 'Struct Star Expansion' test ### What changes were proposed in this pull request? This PR is a minor followup of #37973 to match the pattern of either `ArrayBuffer` (Scala 2.12) or `List` (Scala 2.13) in "Struct Star Expansion" test case at `SQLQuerySuite.scala`. ### Why are the changes needed? Currently, the test fails with Scala 2.13 (https://github.com/apache/spark/actions/runs/3146198025/jobs/5114388079): ``` [info] - Struct Star Expansion *** FAILED *** (1 second, 801 milliseconds) [info] Map("attributes" -> "List(a)") did not equal Map("attributes" -> "ArrayBuffer(a)") (SparkFunSuite.scala:328) [info] Analysis: [info] JavaCollectionWrappers$JMapWrapper(attributes: List(a) -> ArrayBuffer(a)) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.SparkFunSuite.checkError(SparkFunSuite.scala:328) [info] at org.apache.spark.SparkFunSuite.checkError(SparkFunSuite.scala:369) ``` We should also consider Scala 2.13 case. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested: ```scala scala> "List(a)".matches("(ArrayBuffer|List)\$a\$") res0: Boolean = true scala> "ArrayBuffer(a)".matches("(ArrayBuffer|List)\$a\$") res1: Boolean = true ``` Closes #38045 from HyukjinKwon/SPARK-40540-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…nfigure test log output ### What changes were proposed in this pull request? This pr make `connect` module change to use to use `log4j2.properties` to configure test log output as others modules. ### Why are the changes needed? You should use `log4j2.properties` to configure logs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38041 from LuciferYang/connect-log. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

### What changes were proposed in this pull request? Bump kubernetes-client version from 5.12.3 to 6.1.1 and clean up all the deprecations. ### Why are the changes needed? To keep up with kubernetes-client [changes](fabric8io/kubernetes-client@v5.12.3...v6.1.1). As this is an upgrade where the main version changed I have cleaned up all the deprecations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? #### Unit tests #### Manual tests for submit and application management Started an application in a non-default namespace (`bla`): ``` ➜ spark git:(SPARK-40458) ✗ ./bin/spark-submit \ --master k8s://http://127.0.0.1:8001 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=5 \ --conf spark.kubernetes.namespace=bla \ --conf spark.kubernetes.container.image=docker.io/kubespark/spark:3.4.0-SNAPSHOT_064A99CC-57AF-46D5-B743-5B12692C260D \ local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar 200000 ``` Check that we cannot find it in the default namespace even with glob without the namespace definition: ``` ➜ spark git:(SPARK-40458) ✗ minikube kubectl -- config set-context --current --namespace=default Context "minikube" modified. ➜ spark git:(SPARK-40458) ✗ ./bin/spark-submit --status "spark-pi-*" --master k8s://http://127.0.0.1:8001 Submitting a request for the status of submission spark-pi-* in k8s://http://127.0.0.1:8001. No applications found. ``` Then check we can find it by specifying the namespace: ``` ➜ spark git:(SPARK-40458) ✗ ./bin/spark-submit --status "bla:spark-pi-*" --master k8s://http://127.0.0.1:8001 Submitting a request for the status of submission bla:spark-pi-* in k8s://http://127.0.0.1:8001. Application status (driver): pod name: spark-pi-4c4e70837c86ae1a-driver namespace: bla labels: spark-app-name -> spark-pi, spark-app-selector -> spark-c95a9a0888214c01a286eb7ba23980a0, spark-role -> driver, spark-version -> 3.4.0-SNAPSHOT pod uid: 0be8952e-3e00-47a3-9082-9cb45278ed6d creation time: 2022-09-27T01:19:06Z service account name: default volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-wxnqw node name: minikube start time: 2022-09-27T01:19:06Z phase: Running container status: container name: spark-kubernetes-driver container image: kubespark/spark:3.4.0-SNAPSHOT_064A99CC-57AF-46D5-B743-5B12692C260D container state: running container started at: 2022-09-27T01:19:07Z ``` Changing the namespace to `bla` with `kubectl`: ``` ➜ spark git:(SPARK-40458) ✗ minikube kubectl -- config set-context --current --namespace=bla Context "minikube" modified. ``` Checking we can find it without specifying the namespace (and glob): ``` ➜ spark git:(SPARK-40458) ✗ ./bin/spark-submit --status "spark-pi-*" --master k8s://http://127.0.0.1:8001 Submitting a request for the status of submission spark-pi-* in k8s://http://127.0.0.1:8001. Application status (driver): pod name: spark-pi-4c4e70837c86ae1a-driver namespace: bla labels: spark-app-name -> spark-pi, spark-app-selector -> spark-c95a9a0888214c01a286eb7ba23980a0, spark-role -> driver, spark-version -> 3.4.0-SNAPSHOT pod uid: 0be8952e-3e00-47a3-9082-9cb45278ed6d creation time: 2022-09-27T01:19:06Z service account name: default volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-wxnqw node name: minikube start time: 2022-09-27T01:19:06Z phase: Running container status: container name: spark-kubernetes-driver container image: kubespark/spark:3.4.0-SNAPSHOT_064A99CC-57AF-46D5-B743-5B12692C260D container state: running container started at: 2022-09-27T01:19:07Z ``` Killing the app: ``` ➜ spark git:(SPARK-40458) ✗ ./bin/spark-submit --kill "spark-pi-*" --master k8s://http://127.0.0.1:8001 Submitting a request to kill submission spark-pi-* in k8s://http://127.0.0.1:8001. Grace period in secs: not set. Deleting driver pod: spark-pi-4c4e70837c86ae1a-driver. ``` Closes #37990 from attilapiros/SPARK-40458. Authored-by: attilapiros <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR upgrade sbt-protoc to 1.0.6. ### Why are the changes needed? sbt-protoc 1.0.1 dose not exist. <img width="993" alt="image" src="https://user-images.githubusercontent.com/5399861/193038436-0096b263-0c2d-4f7d-8fac-9b3f97089490.png"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #38049 from wangyum/sbt-protoc. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…n Aggregate ### What changes were proposed in this pull request? This PR implements the implicit lateral column alias on `Aggregate` case. For example, ```sql -- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus) FROM employee GROUP BY dept ``` The high level implementation idea is to insert the `Project` node above, and falling back to the resolution of lateral alias of Project code path in the last PR. * Phase 1: recognize resolved lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference` * Phase 2: when the `Aggregate` operator is resolved, it goes through the whole aggregation list, extracts the aggregation expressions and grouping expressions to keep them in this `Aggregate` node, and add a `Project` above with the original output. It doesn't do anything on `LateralColumnAliasReference`, but completely leave it to the Project in the future turns of this rule. Example: ``` // Before rewrite: Aggregate [dept#14] [dept#14 AS a#12, 'a + 1, avg(salary#16) AS b#13, 'b + avg(bonus#17)] +- Child [dept#14,name#15,salary#16,bonus#17] // After phase 1: Aggregate [dept#14] [dept#14 AS a#12, lca(a) + 1, avg(salary#16) AS b#13, lca(b) + avg(bonus#17)] +- Child [dept#14,name#15,salary#16,bonus#17] // After phase 2: Project [dept#14 AS a#12, lca(a) + 1, avg(salary)#26 AS b#13, lca(b) + avg(bonus)#27] +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14] +- Child [dept#14,name#15,salary#16,bonus#17] // Now the problem falls back to the lateral alias resolution in Project. // After future rounds of this rule: Project [a#12, a#12 + 1, b#13, b#13 + avg(bonus)#27] +- Project [dept#14 AS a#12, avg(salary)#26 AS b#13] +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14] +- Child [dept#14,name#15,salary#16,bonus#17] ``` Similar as the last PR (apache#38776), because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`. ### Why are the changes needed? Similar as stated in apache#38776. ### Does this PR introduce _any_ user-facing change? Yes, as shown in the above example, it will be able to resolve lateral column alias in Aggregate. ### How was this patch tested? Existing tests and newly added tests. Closes apache#39040 from anchovYu/SPARK-27561-agg. Authored-by: Xinyi Yu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…onnect ### What changes were proposed in this pull request? Implement Arrow-optimized Python UDFs in Spark Connect. Please see apache#39384 for motivation and performance improvements of Arrow-optimized Python UDFs. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. In Spark Connect Python Client, users can: 1. Set `useArrow` parameter True to enable Arrow optimization for a specific Python UDF. ```sh >>> df = spark.range(2) >>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show() +------------+ |<lambda>(id)| +------------+ | 1| | 2| +------------+ # ArrowEvalPython indicates Arrow optimization >>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain() == Physical Plan == *(2) Project [pythonUDF0#18 AS <lambda>(id)#16] +- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200 +- *(1) Range (0, 2, step=1, splits=1) ``` 2. Enable `spark.sql.execution.pythonUDF.arrow.enabled` Spark Conf to make all Python UDFs Arrow-optimized. ```sh >>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True) >>> df.select(udf(lambda x : x + 1)('id')).show() +------------+ |<lambda>(id)| +------------+ | 1| | 2| +------------+ # ArrowEvalPython indicates Arrow optimization >>> df.select(udf(lambda x : x + 1)('id')).explain() == Physical Plan == *(2) Project [pythonUDF0#30 AS <lambda>(id)#28] +- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200 +- *(1) Range (0, 2, step=1, splits=1) ``` ### How was this patch tested? Parity unit tests. Closes apache#40725 from xinrong-meng/connect_arrow_py_udf. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

pull bot added the ⤵️ pull label Sep 29, 2022

github-actions bot added the SQL label Sep 29, 2022

github-actions bot added the CONNECT label Sep 29, 2022

github-actions bot added BUILD KUBERNETES labels Sep 29, 2022

pull bot merged commit bfad44e into wangyum:master Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #27

[pull] master from apache:master #27

Uh oh!

pull bot commented Sep 29, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[pull] master from apache:master #27

[pull] master from apache:master #27

Uh oh!

Conversation

pull bot commented Sep 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pull bot commented Sep 29, 2022 •

edited

Loading