[pull] master from apache:master #13

pull · 2022-10-28T01:35:33Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

### What changes were proposed in this pull request? This pr aims to upgrade zstd-jni to 1.5.2-5 ### Why are the changes needed? This version start to support magic less data frames: - luben/zstd-jni#151 - luben/zstd-jni#235 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38412 from LuciferYang/zstd-1.5.2-5. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to upgrade Apache Arrow to 10.0.0 ### Why are the changes needed? This version bring some bug fix and improvements, the official release notes as follows: - https://arrow.apache.org/release/10.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38369 from LuciferYang/SPARK-40895. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…e `Dockerfile.java17` ### What changes were proposed in this pull request? This PR aims to use `Java 17` in K8s Dockerfile by default and remove `Dockerfile.java17`. ### Why are the changes needed? To update for Apache Spark 3.4.0. ``` $ docker run -it --rm kubespark/spark:dev cat /etc/os-release | grep PRETTY PRETTY_NAME="Ubuntu 22.04.1 LTS" ``` ``` $ docker run -it --rm kubespark/spark:dev java -version | grep 64 OpenJDK 64-Bit Server VM Temurin-17.0.4.1+1 (build 17.0.4.1+1, mixed mode, sharing) ``` ### Does this PR introduce _any_ user-facing change? Yes, but only the published docker images will get the latest OS (Ubuntu 22.04.1 LTS) and Java (17 LTS). ### How was this patch tested? Pass the CIs. Closes #38417 from dongjoon-hyun/SPARK-40941. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…llow multiple window_time calls ### What changes were proposed in this pull request? This PR proposes to loosen the requirement of window_time rule to allow multiple distinct window_time calls. After this change, users can call the window_time function with different windows in the same logical node (select, where, etc.). Given that we allow multiple calls of window_time in projection, we no longer be able to use the reserved column name "window_time". This PR picked up the SQL representation of the WindowTime, to distinguish each distinct function call. (This is different from time window/session window, but "arguably" saying, they are incorrect. Just that we can't fix them now since the change would incur backward incompatibility...) ### Why are the changes needed? The rule for window time followed the existing rules of time window / session window which only allows a single function call in a same projection (strictly saying, it considers the call of function as once if the function is called with same parameters). For time window/session window rules , the restriction makes sense since allowing this would produce cartesian product of rows (although Spark can handle it). But given that window_time only produces one value, the restriction no longer makes sense. ### Does this PR introduce _any_ user-facing change? Yes since it changes the resulting column name from window_time function call, but the function is not released yet. ### How was this patch tested? New test case. Closes #38361 from HeartSaVioR/SPARK-40892. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? This PR proposes to upgrade pandas version to 1.5.0 since the new pandas version is released. Please refer to [What's new in 1.5.0](https://pandas.pydata.org/docs/whatsnew/v1.5.0.html) for more detail. ### Why are the changes needed? Pandas API on Spark should follow the latest pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should pass Closes #37955 from itholic/SPARK-40512. Authored-by: itholic <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…atedScalarSubquery ### What changes were proposed in this pull request? This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to support non-aggregated one-row subquery. In CheckAnalysis, we allow three types of correlated scalar subquery patterns: 1. SubqueryAlias/Project + Aggregate 2. SubqueryAlias/Project + Filter + Aggregate 3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1) https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856 We should support the thrid case in `splitSubquery` to avoid `Unexpected operator` exceptions. ### Why are the changes needed? To fix an issue with correlated subquery rewrite. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. Closes #38336 from allisonwang-db/spark-40862-split-subquery. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? 1. Remove unused error classes: INCONSISTENT_BEHAVIOR_CROSS_VERSION.FORMAT_DATETIME_BY_NEW_PARSER, NAMESPACE_ALREADY_EXISTS, NAMESPACE_NOT_EMPTY, NAMESPACE_NOT_FOUND. 2. Rename the error class WRONG_NUM_PARAMS to WRONG_NUM_ARGS. 3. Use correct error class INDEX_ALREADY_EXISTS in the exception `IndexAlreadyExistsException` instead of INDEX_NOT_FOUND. 4. Quote regexp patterns by ''. 5. Fix indentations in [QueryCompilationErrors.scala](https://github.com/apache/spark/pull/38398/files#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56). ### Why are the changes needed? To address tech debts. ### Does this PR introduce _any_ user-facing change? Yes, it modifies user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "test:testOnly *SQLQuerySuite" $ build/sbt "test:testOnly *StringExpressionsSuite" $ build/sbt "test:testOnly *.RegexpExpressionsSuite" ``` Closes #38398 from MaxGekk/remove-unused-error-classes. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

…me API ### What changes were proposed in this pull request? This PR migrates all existing proto tests to be DataFrame API based. ### Why are the changes needed? 1. The goal for proto tests is to test the capability of representing DataFrames by the Connect proto. So comparing with DataFrame API is more accurate. 2. There are some Connect plan execution requiring SparkSession anyway. We can unify all tests into one suite by only using DataFrame API (e.g. We can merge `SparkConnectDeduplicateSuite.scala` into `SparkConnectProtoSuite.scala`. 3. This also enables the possibility that we can also test result (not only plan) in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT. Closes #38406 from amaliujia/refactor_server_tests. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new #39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic #41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial #47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc #50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc #55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc #56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic #57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual #61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic #39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic #43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…building ### What changes were proposed in this pull request? This PR aims to add `libwebp-dev` to fix `dev/infra/Dockerfile` building. ### Why are the changes needed? To fix `build_infra_images_cache` GitHub Action job - https://github.com/apache/spark/actions/workflows/build_infra_images_cache.yml <img width="545" height="88" alt="Screenshot 2025-11-02 at 14 56 19" src="https://github.com/user-attachments/assets/f70d6093-6574-40f3-a097-ba5c9086f3c1" /> The root cause is identical with other Dockerfile failure. ``` #13 578.4 -------------------------- [ERROR MESSAGE] --------------------------- #13 578.4 <stdin>:1:10: fatal error: ft2build.h: No such file or directory #13 578.4 compilation terminated. #13 578.4 -------------------------------------------------------------------- #13 578.4 ERROR: configuration failed for package 'ragg' ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Especially, `Cache base image` test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52840 from dongjoon-hyun/SPARK-54141. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

LuciferYang and others added 3 commits October 27, 2022 10:08

github-actions bot added BUILD DOCS KUBERNETES labels Oct 28, 2022

pull bot added ⤵️ pull and removed BUILD DOCS KUBERNETES labels Oct 28, 2022

github-actions bot added BUILD DOCS KUBERNETES SQL labels Oct 28, 2022

github-actions bot added CORE PANDAS API ON SPARK PYTHON labels Oct 28, 2022

allisonwang-db and others added 3 commits October 28, 2022 12:25

github-actions bot added the CONNECT label Oct 28, 2022

pull bot merged commit d26e484 into huangxiaopingRD:master Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #13

[pull] master from apache:master #13

Uh oh!

pull bot commented Oct 28, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[pull] master from apache:master #13

[pull] master from apache:master #13

Uh oh!

Conversation

pull bot commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull bot commented Oct 28, 2022 •

edited

Loading