[pull] master from apache:master #8

pull · 2022-10-20T07:38:52Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…ect * ### What changes were proposed in this pull request? 1. Sync newest proto to python client. 2. Update Aggregate to match proto change. 3. Change `select` to have it accept both `column` and `str` 4. Make sure `*` pass through the entire path which has been implemented on the server side #38023 ### Why are the changes needed? Update python client side to match the change in connect proto. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38218 from amaliujia/select_start_in_python. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Remove unused apt lists cache ### Why are the changes needed? Clean cache to reduce docker image size. This is also [recommanded](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run) by docker community: ``` $ docker run --user 0:0 -ti apache/spark bash root5d1ca347279e:/opt/spark/work-dir# ls /var/lib/apt/lists/ auxfiles lock deb.debian.org_debian_dists_bullseye-updates_InRelease partial deb.debian.org_debian_dists_bullseye-updates_main_binary-arm64_Packages.lz4 security.debian.org_debian-security_dists_bullseye-security_InRelease deb.debian.org_debian_dists_bullseye_InRelease security.debian.org_debian-security_dists_bullseye-security_main_binary-arm64_Packages.lz4 deb.debian.org_debian_dists_bullseye_main_binary-arm64_Packages.lz4 root5d1ca347279e:/opt/spark/work-dir# du --max-depth=1 -h /var/lib/apt/lists/ 4.0K /var/lib/apt/lists/partial 4.0K /var/lib/apt/lists/auxfiles 17M /var/lib/apt/lists/ ``` ### Does this PR introduce _any_ user-facing change? Yes in some level, image size is reduced. ### How was this patch tested? K8s CI passed Closes #38298 from Yikun/SPARK-40513. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR proposes to flip the default value of Kafka offset fetching config (`spark.sql.streaming.kafka.useDeprecatedOffsetFetching`) from `true` to `false`, which enables AdminClient based offset fetching by default. ### Why are the changes needed? We had been encountered several production issues with old offset fetching (e.g. hang, issue with Kafka consumer group rebalance) which could be mitigated with new offset fetching. Despite the breaking change on the ACL, there is no need for moderate users to suffer from the old way. The discussion went through the dev. mailing list: https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c ### Does this PR introduce _any_ user-facing change? Yes, especially users who relies on Kafka ACL based on consumer group. They need to either adjust the ACL to topic based one, or set the value to `true` for `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` to use the old approach. ### How was this patch tested? Existing UTs. Closes #38306 from HeartSaVioR/SPARK-40844. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…first ### What changes were proposed in this pull request? This pr aims to pin GA used Java version to pass GA test first: - Java 8 pin to 8.0.345 - Java 11 pin to 11.0.16 - Java 17 pin to 17.0.4 this change should be revert after find the root cause ### Why are the changes needed? Make GA passed first. The following test failed with 8u352/11.0.17/17.0.5: ``` [info] *** 12 TESTS FAILED *** [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 [error] Failed tests: [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38311 from LuciferYang/java-version. Authored-by: yangjie01 <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…nnect ### What changes were proposed in this pull request? Add initial Read API for Spark Connect that allows setting schema, format, option and path, and then to read files into DataFrame. ### Why are the changes needed? PySpark readwriter API parity for Spark Connect ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38086 from amaliujia/SPARK-40539. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…2 and fix ps.mlflow doctest ### What changes were proposed in this pull request? Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest ### Why are the changes needed? - Upgrade infra base image to `focal-20220922` (Ubuntu 20.04 currently latest) - Infra Image Python version updated. - numpy 1.23.3 --> 1.23.4 - mlflow 1.28.0 --> 1.29.0 - matplotlib 3.5.3 --> 3.6.1 - pip 22.2.2 --> 22.3 - scipy 1.9.1 --> 1.9.3 Full list: https://www.diffchecker.com/e6eZZaYn - Fix ps.mlfow doctest (due to mlflow upgrade): ``` ********************************************************************** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 158, in pyspark.pandas.mlflow.load_model Failed example: with mlflow.start_run(): lr = LinearRegression() lr.fit(train_x, train_y) mlflow.sklearn.log_model(lr, "model") Expected: LinearRegression(...) Got: LinearRegression() <mlflow.models.model.ModelInfo object at 0x7fef9578deb0> ``` ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? All CI passed Closes #38304 from Yikun/SPARK-40838. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Yikun Jiang <[email protected]>

### What changes were proposed in this pull request? Make HeartbeatReceiver as an IsolatedRpcEndpoint then it has dedicated single thread to process heartbeats. ### Why are the changes needed? All RpcEndpoint including HeartbeatReceiver in driver are sharing one thread pool. When there're lots of rpc messages queued, the waiting process time of heartbeat time could easily exceed heartbeat timeout, which generates lots of false positive. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. Closes #38231 from warrenzhu25/heartbeat. Authored-by: Warren Zhu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? [`mypy-protobuf` 3.4.0](https://pypi.org/project/mypy-protobuf/#history) is just released, and will break master shortly ### Why are the changes needed? to make CI happy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually check Closes #38316 from zhengruifeng/infra_ping_mypy_protobuf. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…wableSuite ### What changes were proposed in this pull request? A trivial change, this pr fix deprecated api usage in `SparkThrowableSuite` as follows: ``` [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:77: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.errorClassFileContents | origin=org.apache.commons.io.IOUtils.toString | version=] method toString in class IOUtils is deprecated [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:104: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.errorClassReadMeContents | origin=org.apache.commons.io.IOUtils.toString | version=] method toString in class IOUtils is deprecated [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:325: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.$anonfun | origin=org.apache.commons.io.FileUtils.writeStringToFile | version=] method writeStringToFile in class FileUtils is deprecated [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:335: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.$anonfun.reader | origin=java.io.File.toURL | version=] method toURL in class File is deprecated [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:343: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.$anonfun | origin=org.apache.commons.io.FileUtils.writeStringToFile | version=] method writeStringToFile in class FileUtils is deprecated [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:354: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.$anonfun.e | origin=java.io.File.toURL | version=] method toURL in class File is deprecated [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:362: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.$anonfun | origin=org.apache.commons.io.FileUtils.writeStringToFile | version=] method writeStringToFile in class FileUtils is deprecated [WARNING] /spark-source/core/src/test/scala/org/apache/spark/SparkThrowableSuite.scala:380: [deprecation org.apache.spark.SparkThrowableSuite.<local SparkThrowableSuite>.$anonfun.e | origin=java.io.File.toURL | version=] method toURL in class File is deprecated ``` ### Why are the changes needed? Clean up deprecated api usage in `SparkThrowableSuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Actions. Closes #38305 from LuciferYang/SPARK-40843. Lead-authored-by: YangJie <[email protected]> Co-authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? This fixes a corner-case regression caused by #36625. Users may have existing views that have invalid locations due to historical reasons. The location is actually useless for a view, but after #36625 , they start to fail to read the view as qualifying the location fails. We should just skip qualifying view locations. ### Why are the changes needed? avoid regression ### Does this PR introduce _any_ user-facing change? Spark can read view with invalid location again. ### How was this patch tested? manually test. View with an invalid location is kind of "broken" and can't be dropped (HMS fails to drop it), so we can't write a UT for it. Closes #38321 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… Python client ### What changes were proposed in this pull request? Following up after #38275, improve limit and offset in Python client. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38314 from amaliujia/python_test_limit_offset. Authored-by: Rui Wang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Upgrade action/checkout to v3 (point ot v3.1 now). ### Why are the changes needed? - https://github.com/actions/checkout/releases/tag/v3.1.0 cleanup "[The 'set-output' command is deprecated and will be disabled soon.](actions/checkout#959 (comment))" - https://github.com/actions/checkout/releases/tag/v3.0.0 since v3, use the node 16 to cleanup "[Node.js 12 actions are deprecated](https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/)" According to actions/checkout#959 (comment), v2.5 also address 'set-output' warning, but only v3 support node 16, so we upgrade to v3.1 rather than v2.5 ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? CI passed Closes #38322 from Yikun/checkout-v3. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Yikun Jiang <[email protected]>

### What changes were proposed in this pull request? Change `set-output` to `GITHUB_OUTPUT`. ### Why are the changes needed? The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/ ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? - CI passed - Also do a local test on benchmark: https://github.com/Yikun/spark/actions/runs/3294384181/jobs/5431945626 Closes #38323 from Yikun/set-output. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Yikun Jiang <[email protected]>

…bled ### What changes were proposed in this pull request? This PR proposes the make the tests added in #38050 pass with ANSI mode enabled by avoiding string binary operations. ### Why are the changes needed? To make the tests pass with ANSI enabled on. Currently, it fails as below (https://github.com/apache/spark/actions/runs/3286184541/jobs/5414029918): ``` [info] - SPARK-40615: Check unsupported data type when decorrelating subqueries *** FAILED *** (118 milliseconds) [info] "[DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE] Cannot resolve "(a + a)" due to data type mismatch: the binary operator requires the input type ("NUMERIC" or "INTERVAL DAY TO SECOND" or "INTERVAL YEAR TO MONTH" or "INTERVAL"), not "STRING".; line 1 pos 15; [info] 'Project [unresolvedalias(scalar-subquery#426412 [], None)] [info] : +- 'Project [unresolvedalias((a#426411 + a#426411), None)] [info] : +- SubqueryAlias __auto_generated_subquery_name [info] : +- Project [upper(cast(outer(x#426413)[a] as string)) AS a#426411] [info] : +- OneRowRelation [info] +- SubqueryAlias v1 [info] +- View (`v1`, [x#426413]) [info] +- Project [cast(x#426414 as map<string,int>) AS x#426413] [info] +- SubqueryAlias t [info] +- LocalRelation [x#426414] [info] " did not contain "Correlated column reference 'v1.x' cannot be map type" (SubquerySuite.scala:2480) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.SubquerySuite.$anonfun$new$320(SubquerySuite.scala:2480) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView(SQLTestUtils.scala:276) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView$(SQLTestUtils.scala:274) [info] at org.apache.spark.sql.SubquerySuite.withTempView(SubquerySuite.scala:32) [info] at org.apache.spark.sql.SubquerySuite.$anonfun$new$319(SubquerySuite.scala:2459) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran the tests and verified that it passes. Closes #38325 from HyukjinKwon/SPARK-40615-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…atest Java 8/11/17 ### What changes were proposed in this pull request? The main change of this pr as follows: - Replace `Antarctica/Vostok` to `Asia/Urumqi` in Spark code - Replace `Europe/Amsterdam` to `Europe/Brussels` in Spark code - Regenerate `gregorian-julian-rebase-micros.json` using generate 'gregorian-julian-rebase-micros.json' in `RebaseDateTimeSuite` with Java 8u352 - Regenerate `julian-gregorian-rebase-micros.json` using generate 'julian-gregorian-rebase-micros.json' in RebaseDateTimeSuite with Java 8u352 - Revert change of SPARK-40846 ### Why are the changes needed? Make GA run successfully with the latest Java 8/11/17: - Java 8u352 - Java 11.0.17 - Java 17.0.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: the following commands can test pass with Java 8u345, 8u352, 11.0.16, 11.0.17, 17.0.4 and 17.0.5 - `build/sbt "catalyst/test"` - `build/sbt "sql/testOnly *SQLQueryTestSuite -- -t \"timestamp-ltz.sql\""` - `build/sbt "sql/testOnly *SQLQueryTestSuite -- -t \"timestamp-ntz.sql\""` Closes #38317 from LuciferYang/SPARK-40851. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Max Gekk <[email protected]>

…onto error classes ### What changes were proposed in this pull request? In the PR, I propose to use error classes in the case of type check failure in Bloom Filter Agg expressions. ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? ``` build/sbt "sql/testOnly *SQLQueryTestSuite" build/sbt "test:testOnly org.apache.spark.SparkThrowableSuite" build/sbt "test:testOnly *BloomFilterAggregateQuerySuite" ``` Closes #38315 from lvshaokang/SPARK-40768. Authored-by: lvshaokang <[email protected]> Signed-off-by: Max Gekk <[email protected]>

### What changes were proposed in this pull request? This PR supports `Deduplicate` to Connect proto and DSL. Note that `Deduplicate` can not be replaced by SQL's `SELECT DISTINCT col_list`. The difference is that `Deduplicate` allows to remove duplicated rows based on a set of columns but returns all the columns. SQL's `SELECT DISTINCT col_list`, instead, can only return the `col_list`. ### Why are the changes needed? 1. To improve proto API coverage. 2. `Deduplicate` blocks #38166 because we want support `Union(isAll=false)` but that will return `Union().Distinct()` to match existing DataFrame API. `Deduplicate` is needed to write test cases for `Union(isAll=false)`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38276 from amaliujia/supportDropDuplicates. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Adds support for compiled Java classes to Protobuf functions. This is tested with Protobuf v3 classes. V2 vs V3 issues will be handled in a separate PR. The main changes in this PR: - Changes to top level API: - Adds new version that takes just the class name. - Changes the order of arguments for existing API with descriptor files (`messageName` and `descFilePath` are swapped). - Protobuf utils methods to create descriptor from Java class name. - Many unit tests are update to check both versions : (1) with descriptor file and (2) with Java class name. - Maven build updates to generate Java classes to use in tests. - Miscellaneous changes: - Adds `proto` to package name in `proto` files used in tests. - A few TODO comments about improvements ### Why are the changes needed? Java compiled classes is a common method for users to provide Protobuf definitions. ### Does this PR introduce _any_ user-facing change? No. This updates interface, but for a new feature in active development. ### How was this patch tested? - Unit tests Closes #38286 from rangadi/protobuf-java. Authored-by: Raghu Angadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? Implement `DataFrame.sample` in Connect ### Why are the changes needed? for DataFrame API coverage ### Does this PR introduce _any_ user-facing change? Yes, new API ``` def sample( self, fraction: float, *, withReplacement: bool = False, seed: Optional[int] = None, ) -> "DataFrame": ``` ### How was this patch tested? added UT Closes #38310 from zhengruifeng/connect_df_sample. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…istent of lint-scala script as was ### What changes were proposed in this pull request? This PR proposes to keep `dev/lint-scala` quiet as was. ### Why are the changes needed? To remove noisy output from the `dev/lint-scala` script. **Before** Success ``` Scalastyle checks passed. Using `mvn` from path: /.../spark/build/apache-maven-3.8.6/bin/mvn [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Detecting the operating system and CPU architecture [INFO] ------------------------------------------------------------------------ [INFO] os.detected.name: osx [INFO] os.detected.arch: x86_64 [INFO] os.detected.version: 10.16 [INFO] os.detected.version.major: 10 [INFO] os.detected.version.minor: 16 [INFO] os.detected.classifier: osx-x86_64 [INFO] [INFO] ----------------< org.apache.spark:spark-connect_2.12 >----------------- [INFO] Building Spark Project Connect 3.4.0-SNAPSHOT [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] --- mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) spark-connect_2.12 --- [INFO] parsed config (v3.5.9): dev/.scalafmt.conf [INFO] Scalafmt results: 0 of 11 were unformatted Details: Formatted: Connect.scala Formatted: DataTypeProtoConverter.scala Formatted: SparkConnectPlanner.scala Formatted: SparkConnectPlugin.scala Formatted: SparkConnectCommandPlanner.scala Formatted: SparkConnectStreamHandler.scala Formatted: SparkConnectService.scala Formatted: package.scala Formatted: SparkConnectProtoSuite.scala Formatted: SparkConnectPlannerSuite.scala Formatted: SparkConnectCommandPlannerSuite.scala [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 5.257 s [INFO] Finished at: 2022-10-21T11:18:19+09:00 [INFO] ------------------------------------------------------------------------ ``` Failure ``` Scalastyle checks passed. Using `mvn` from path: /Users/hyukjin.kwon/workspace/forked/spark/build/apache-maven-3.8.6/bin/mvn [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Detecting the operating system and CPU architecture [INFO] ------------------------------------------------------------------------ [INFO] os.detected.name: osx [INFO] os.detected.arch: x86_64 [INFO] os.detected.version: 10.16 [INFO] os.detected.version.major: 10 [INFO] os.detected.version.minor: 16 [INFO] os.detected.classifier: osx-x86_64 [INFO] [INFO] ----------------< org.apache.spark:spark-connect_2.12 >----------------- [INFO] Building Spark Project Connect 3.4.0-SNAPSHOT [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] --- mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) spark-connect_2.12 --- [INFO] parsed config (v3.5.9): dev/.scalafmt.conf [INFO] Scalafmt results: 0 of 11 were unformatted Details: Formatted: Connect.scala Formatted: DataTypeProtoConverter.scala Formatted: SparkConnectPlanner.scala Formatted: SparkConnectPlugin.scala Formatted: SparkConnectCommandPlanner.scala Formatted: SparkConnectStreamHandler.scala Formatted: SparkConnectService.scala Formatted: package.scala Formatted: SparkConnectProtoSuite.scala Formatted: SparkConnectPlannerSuite.scala Formatted: SparkConnectCommandPlannerSuite.scala [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 5.257 s [INFO] Finished at: 2022-10-21T11:18:19+09:00 [INFO] ------------------------------------------------------------------------ (python3.9) ➜ spark git:(master) ./dev/lint-scala Scalastyle checks passed. Using `mvn` from path: /Users/hyukjin.kwon/workspace/forked/spark/build/apache-maven-3.8.6/bin/mvn [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Detecting the operating system and CPU architecture [INFO] ------------------------------------------------------------------------ [INFO] os.detected.name: osx [INFO] os.detected.arch: x86_64 [INFO] os.detected.version: 10.16 [INFO] os.detected.version.major: 10 [INFO] os.detected.version.minor: 16 [INFO] os.detected.classifier: osx-x86_64 [INFO] [INFO] ----------------< org.apache.spark:spark-connect_2.12 >----------------- [INFO] Building Spark Project Connect 3.4.0-SNAPSHOT [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] --- mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) spark-connect_2.12 --- [INFO] parsed config (v3.5.9): dev/.scalafmt.conf [ERROR] unformatted file at: /Users/hyukjin.kwon/workspace/forked/spark/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala [INFO] Scalafmt results: 1 of 11 were unformatted Details: Formatted: Connect.scala Formatted: DataTypeProtoConverter.scala Requires formatting: SparkConnectPlanner.scala Formatted: SparkConnectPlugin.scala Formatted: SparkConnectCommandPlanner.scala Formatted: SparkConnectStreamHandler.scala Formatted: SparkConnectService.scala Formatted: package.scala Formatted: SparkConnectProtoSuite.scala Formatted: SparkConnectPlannerSuite.scala Formatted: SparkConnectCommandPlannerSuite.scala [ERROR] org.apache.maven.plugin.MojoExecutionException: Scalafmt: Unformatted files found at org.antipathy.mvn_scalafmt.FormatMojo.execute (FormatMojo.java:91) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:370) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:351) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:171) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:163) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293) at org.apache.maven.cli.MavenCli.main (MavenCli.java:196) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 5.223 s [INFO] Finished at: 2022-10-21T11:19:58+09:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.antipathy:mvn-scalafmt_2.12:1.1.1640084764.9f463a9:format (default-cli) on project spark-connect_2.12: Error formatting Scala files: Scalafmt: Unformatted files found -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException The scalafmt check failed on connector/connect. Before submitting your change, please make sure to format your code using the following command: ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=fase -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect ``` **After** Success ``` Scalastyle checks passed. Scalafmt checks passed. ``` Failure ``` Scalastyle checks passed. The scalafmt check failed on connector/connect at following occurrences: Requires formatting: SparkConnectPlanner.scala Before submitting your change, please make sure to format your code using the following command: ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=fase -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect ``` In this way, this is consistent before #38258. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested as described above. Closes #38326 from HyukjinKwon/SPARK-40799-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…terpolation ### What changes were proposed in this pull request? Adds missing `s` prefix to enable string interpolation. Complements #38297. ### Why are the changes needed? Strings will not contain substituted values but variable names. ### Does this PR introduce _any_ user-facing change? Log messages will change. ### How was this patch tested? Not tested. Closes #38307 from EnricoMi/branch-fix-string-interpolation-2. Authored-by: Enrico Minack <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ollection ### What changes were proposed in this pull request? This patch provides a temporary implementation of result batches as JSON instead of the 'broken' CSV format that was simply generating unescaped CSV lines. In this implementation we actually leverage the existing Spark functionality to generate JSON and then convert this into result batches for Spark Connect. ### Why are the changes needed? Cleanup ### Does this PR introduce _any_ user-facing change? No / Experimental ### How was this patch tested? E2E tests for the Python Client. Closes #38300 from grundprinzip/spark-json. Lead-authored-by: Martin Grund <[email protected]> Co-authored-by: Martin Grund <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This pr aims upgrade jodatime to 2.12.0. ### Why are the changes needed? This version includes: - Add translations for ca, el, eu, fi, hi, hu, in, iw, ms, nn, ro, sk, sv, zh. - DateTimeZone data updated to version 2022egtz. The release notes as following: - https://www.joda.org/joda-time/changes-report.html#a2.12.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38329 from LuciferYang/joda-212. Authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? This pr aims upgrade dropwizard metrics from 4.2.10 to 4.2.12. ### Why are the changes needed? The release notes as follows: - https://github.com/dropwizard/metrics/releases/tag/v4.2.11 - https://github.com/dropwizard/metrics/releases/tag/v4.2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions Closes #38328 from LuciferYang/metrics-4212. Authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…files in Connect Python client ### What changes were proposed in this pull request? 1. Improve README for how to install `buf` dependency and then run the proto generated file script. 2. Improve the message for out-of-sync check script. ### Why are the changes needed? Improve developer experience either when touching Connect proto files or not touch those files but having master branch out of sync with local branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing checks. Closes #38335 from amaliujia/python_generate_buf_read_me. Authored-by: Rui Wang <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? 1, fix the code gen cmd; 2, add the check cmd ### Why are the changes needed? fix the wrong cmd ### Does this PR introduce _any_ user-facing change? yes, user will use correct cmd and know how to check ### How was this patch tested? doc only Closes #38339 from zhengruifeng/connect_doc_nit. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…fy workflow ### What changes were proposed in this pull request? Upgrade actions/github-scripts from v3 to v6 and fix notify workflow ### Why are the changes needed? Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. - Since github-script V5, change from `github.*` to `github.rest.*`, but `request`, `paginate` are unchanged. see also https://github.com/actions/github-script#breaking-changes-in-v5 - Since github-script V6, upgrade node12 to node16 ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? - Due to `pull_request_target`, the current PR is not in effect, we can only do test on local : set default branch to V6, and submit the PR Yikun#181 Notify works well: <img width="850" alt="image" src="https://user-images.githubusercontent.com/1736354/197310102-6c709716-8a99-422d-8d38-3f770b6925f0.png"> Update status set to failed as expeceted: <img width="898" alt="image" src="https://user-images.githubusercontent.com/1736354/197310119-30332769-0553-4ffa-816c-97a5ec0b3c27.png"> And `See test results` set right. https://github.com/Yikun/spark/pull/181/checks?check_run_id=9029035780 Closes #38341 from Yikun/upgrade-actions. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…nabled ### What changes were proposed in this pull request? This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script: ``` bin/pyspark --conf spark.io.encryption.enabled=true ... bar = {"a": "aa", "b": "bb"} foo = spark.sparkContext.broadcast(bar) spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "") spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect() ``` fails with: ``` 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command command = serializer._read_with_length(file) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length return self.loads(obj) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads return cloudpickle.loads(obj, encoding=encoding) EOFError: Ran out of input ``` The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code: https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420 the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side. Please note that the example above works in Spark 3.3 without this fix. That is because #36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`: https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242 changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that #36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs). ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf. Authored-by: Peter Toth <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? pin 'grpcio==1.48.1' 'protobuf==4.21.6' ### Why are the changes needed? to avoid connect-related conflict due to package upgrade ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI Closes #38338 from zhengruifeng/infra_pin_connect. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? Upgrade docker actions to cleanup warning ### Why are the changes needed? - docker/setup-qemu-action from v1 to v2: https://github.com/docker/setup-qemu-action/releases/tag/v2.0.0: Cleanup node 12 warning https://github.com/docker/setup-qemu-action/releases/tag/v2.1.0: Cleanup set-ouput save-state waring - docker/setup-buildx-action from v1 to v2: https://github.com/docker/setup-buildx-action/releases/tag/v2.0.0: Cleanup node 12 warning https://github.com/docker/setup-buildx-action/releases/tag/v2.1.0: Cleanup set-ouput save-state waring - docker/build-push-action from v2 to v3 https://github.com/docker/build-push-action/releases/tag/v3.0.0: Cleanup node 12 warning https://github.com/docker/build-push-action/releases/tag/v3.2.0: Cleanup set-ouput save-state waring - docker/login-action from v1 to v2 https://github.com/docker/login-action/releases/tag/v2.0.0: Cleanup node 12 warning https://github.com/docker/login-action/releases/tag/v2.1.0: Cleanup set-ouput save-state waring ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? CI passed Closes #38342 from Yikun/upgrade-docker. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Yikun Jiang <[email protected]>

…n properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#45214 from zhengruifeng/connect_fix_read_join. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This is a trivial change to replace the loop index from `int` to `long`. Surprisingly, microbenchmark shows more than double performance uplift. Analysis -------- The hot loop of `arrayEquals` method is simplifed as below. Loop index `i` is defined as `int`, it's compared with `length`, which is a `long`, to determine if the loop should end. ``` public static boolean arrayEquals( Object leftBase, long leftOffset, Object rightBase, long rightOffset, final long length) { ...... int i = 0; while (i <= length - 8) { if (Platform.getLong(leftBase, leftOffset + i) != Platform.getLong(rightBase, rightOffset + i)) { return false; } i += 8; } ...... } ``` Strictly speaking, there's a code bug here. If `length` is greater than 2^31 + 8, this loop will never end because `i` as a 32 bit integer is at most 2^31 - 1. But compiler must consider this behaviour as intentional and generate code strictly match the logic. It prevents compiler from generating optimal code. Defining loop index `i` as `long` corrects this issue. Besides more accurate code logic, JIT is able to optimize this code much more aggressively. From microbenchmark, this trivial change improves performance significantly on both Arm and x86 platforms. Benchmark --------- Source code: https://gist.github.com/cyb70289/258e261f388e22f47e4d961431786d1a Result on Arm Neoverse N2: ``` Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 674.313 ± 0.213 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 313.563 ± 2.338 ns/op ``` Result on Intel Cascake Lake: ``` Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 1130.695 ± 0.168 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 461.979 ± 0.097 ns/op ``` Deep dive --------- Dive deep to the machine code level, we can see why the big gap. Listed below are arm64 assembly generated by Openjdk-17 C2 compiler. For `int i`, the machine code is similar to source code, no deep optimization. Safepoint polling is expensive in this short loop. ``` // jit c2 machine code snippet 0x0000ffff81ba8904: mov w15, wzr // int i = 0 0x0000ffff81ba8908: nop 0x0000ffff81ba890c: nop loop: 0x0000ffff81ba8910: ldr x10, [x13, w15, sxtw] // Platform.getLong(leftBase, leftOffset + i) 0x0000ffff81ba8914: ldr x14, [x12, w15, sxtw] // Platform.getLong(rightBase, rightOffset + i) 0x0000ffff81ba8918: cmp x10, x14 0x0000ffff81ba891c: b.ne 0x0000ffff81ba899c // return false if not equal 0x0000ffff81ba8920: ldr x14, [x28, #848] // x14 -> safepoint 0x0000ffff81ba8924: add w15, w15, #0x8 // i += 8 0x0000ffff81ba8928: ldr wzr, [x14] // safepoint polling 0x0000ffff81ba892c: sxtw x10, w15 // extend i to long 0x0000ffff81ba8930: cmp x10, x11 0x0000ffff81ba8934: b.le 0x0000ffff81ba8910 // if (i <= length - 8) goto loop ``` For `long i`, JIT is able to do much more aggressive optimization. E.g, below code snippet unrolls the loop by four. ``` // jit c2 machine code snippet unrolled_loop: 0x0000ffff91de6fe0: sxtw x10, w7 0x0000ffff91de6fe4: add x23, x22, x10 0x0000ffff91de6fe8: add x24, x21, x10 0x0000ffff91de6fec: ldr x13, [x23] // unroll-1 0x0000ffff91de6ff0: ldr x14, [x24] 0x0000ffff91de6ff4: cmp x13, x14 0x0000ffff91de6ff8: b.ne 0x0000ffff91de70a8 0x0000ffff91de6ffc: ldr x13, [x23, #8] // unroll-2 0x0000ffff91de7000: ldr x14, [x24, #8] 0x0000ffff91de7004: cmp x13, x14 0x0000ffff91de7008: b.ne 0x0000ffff91de70b4 0x0000ffff91de700c: ldr x13, [x23, #16] // unroll-3 0x0000ffff91de7010: ldr x14, [x24, #16] 0x0000ffff91de7014: cmp x13, x14 0x0000ffff91de7018: b.ne 0x0000ffff91de70a4 0x0000ffff91de701c: ldr x13, [x23, #24] // unroll-4 0x0000ffff91de7020: ldr x14, [x24, #24] 0x0000ffff91de7024: cmp x13, x14 0x0000ffff91de7028: b.ne 0x0000ffff91de70b0 0x0000ffff91de702c: add w7, w7, #0x20 0x0000ffff91de7030: cmp w7, w11 0x0000ffff91de7034: b.lt 0x0000ffff91de6fe0 ``` ### What changes were proposed in this pull request? A trivial change to replace loop index `i` of method `arrayEquals` from `int` to `long`. ### Why are the changes needed? To improve performance and fix a possible bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49568 from cyb70289/arrayEquals. Authored-by: Yibo Cai <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix apache#45214 to 3.5 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#46291 from zhengruifeng/connect_fix_read_join_35. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix apache#45214 to 3.4 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#46290 from zhengruifeng/connect_fix_read_join_34. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

amaliujia and others added 5 commits October 19, 2022 23:15

github-actions bot added CONNECT CORE DOCS INFRA PYTHON R SQL KUBERNETES labels Oct 20, 2022

github-actions bot added BUILD PANDAS API ON SPARK labels Oct 20, 2022

warrenzhu25 and others added 12 commits October 20, 2022 02:38

github-actions bot added the PROTOBUF label Oct 21, 2022

HyukjinKwon and others added 2 commits October 21, 2022 17:38

github-actions bot added MLLIB ML labels Oct 21, 2022

grundprinzip and others added 9 commits October 21, 2022 20:49

pull bot merged commit fea6458 into huangxiaopingRD:master Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #8

[pull] master from apache:master #8

Uh oh!

pull bot commented Oct 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

[pull] master from apache:master #8

[pull] master from apache:master #8

Uh oh!

Conversation

pull bot commented Oct 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants