forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
[pull] master from apache:master #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What changes were proposed in this pull request? This pr aims to upgrade zstd-jni to 1.5.2-5 ### Why are the changes needed? This version start to support magic less data frames: - luben/zstd-jni#151 - luben/zstd-jni#235 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38412 from LuciferYang/zstd-1.5.2-5. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to upgrade Apache Arrow to 10.0.0 ### Why are the changes needed? This version bring some bug fix and improvements, the official release notes as follows: - https://arrow.apache.org/release/10.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38369 from LuciferYang/SPARK-40895. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…e `Dockerfile.java17` ### What changes were proposed in this pull request? This PR aims to use `Java 17` in K8s Dockerfile by default and remove `Dockerfile.java17`. ### Why are the changes needed? To update for Apache Spark 3.4.0. ``` $ docker run -it --rm kubespark/spark:dev cat /etc/os-release | grep PRETTY PRETTY_NAME="Ubuntu 22.04.1 LTS" ``` ``` $ docker run -it --rm kubespark/spark:dev java -version | grep 64 OpenJDK 64-Bit Server VM Temurin-17.0.4.1+1 (build 17.0.4.1+1, mixed mode, sharing) ``` ### Does this PR introduce _any_ user-facing change? Yes, but only the published docker images will get the latest OS (Ubuntu 22.04.1 LTS) and Java (17 LTS). ### How was this patch tested? Pass the CIs. Closes #38417 from dongjoon-hyun/SPARK-40941. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…llow multiple window_time calls ### What changes were proposed in this pull request? This PR proposes to loosen the requirement of window_time rule to allow multiple distinct window_time calls. After this change, users can call the window_time function with different windows in the same logical node (select, where, etc.). Given that we allow multiple calls of window_time in projection, we no longer be able to use the reserved column name "window_time". This PR picked up the SQL representation of the WindowTime, to distinguish each distinct function call. (This is different from time window/session window, but "arguably" saying, they are incorrect. Just that we can't fix them now since the change would incur backward incompatibility...) ### Why are the changes needed? The rule for window time followed the existing rules of time window / session window which only allows a single function call in a same projection (strictly saying, it considers the call of function as once if the function is called with same parameters). For time window/session window rules , the restriction makes sense since allowing this would produce cartesian product of rows (although Spark can handle it). But given that window_time only produces one value, the restriction no longer makes sense. ### Does this PR introduce _any_ user-facing change? Yes since it changes the resulting column name from window_time function call, but the function is not released yet. ### How was this patch tested? New test case. Closes #38361 from HeartSaVioR/SPARK-40892. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request? This PR proposes to upgrade pandas version to 1.5.0 since the new pandas version is released. Please refer to [What's new in 1.5.0](https://pandas.pydata.org/docs/whatsnew/v1.5.0.html) for more detail. ### Why are the changes needed? Pandas API on Spark should follow the latest pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should pass Closes #37955 from itholic/SPARK-40512. Authored-by: itholic <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…atedScalarSubquery ### What changes were proposed in this pull request? This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to support non-aggregated one-row subquery. In CheckAnalysis, we allow three types of correlated scalar subquery patterns: 1. SubqueryAlias/Project + Aggregate 2. SubqueryAlias/Project + Filter + Aggregate 3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1) https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856 We should support the thrid case in `splitSubquery` to avoid `Unexpected operator` exceptions. ### Why are the changes needed? To fix an issue with correlated subquery rewrite. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. Closes #38336 from allisonwang-db/spark-40862-split-subquery. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? 1. Remove unused error classes: INCONSISTENT_BEHAVIOR_CROSS_VERSION.FORMAT_DATETIME_BY_NEW_PARSER, NAMESPACE_ALREADY_EXISTS, NAMESPACE_NOT_EMPTY, NAMESPACE_NOT_FOUND. 2. Rename the error class WRONG_NUM_PARAMS to WRONG_NUM_ARGS. 3. Use correct error class INDEX_ALREADY_EXISTS in the exception `IndexAlreadyExistsException` instead of INDEX_NOT_FOUND. 4. Quote regexp patterns by ''. 5. Fix indentations in [QueryCompilationErrors.scala](https://github.com/apache/spark/pull/38398/files#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56). ### Why are the changes needed? To address tech debts. ### Does this PR introduce _any_ user-facing change? Yes, it modifies user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "test:testOnly *SQLQuerySuite" $ build/sbt "test:testOnly *StringExpressionsSuite" $ build/sbt "test:testOnly *.RegexpExpressionsSuite" ``` Closes #38398 from MaxGekk/remove-unused-error-classes. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…me API ### What changes were proposed in this pull request? This PR migrates all existing proto tests to be DataFrame API based. ### Why are the changes needed? 1. The goal for proto tests is to test the capability of representing DataFrames by the Connect proto. So comparing with DataFrame API is more accurate. 2. There are some Connect plan execution requiring SparkSession anyway. We can unify all tests into one suite by only using DataFrame API (e.g. We can merge `SparkConnectDeduplicateSuite.scala` into `SparkConnectProtoSuite.scala`. 3. This also enables the possibility that we can also test result (not only plan) in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT. Closes #38406 from amaliujia/refactor_server_tests. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
pull bot
pushed a commit
that referenced
this pull request
Jul 21, 2025
…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new #39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic #41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial #47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc #50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc #55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc #56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic #57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual #61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic #39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic #43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
pull bot
pushed a commit
that referenced
this pull request
Aug 19, 2025
…onicalized expressions
### What changes were proposed in this pull request?
Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example:
```
from pyspark.sql.functions import col, avg, udf
pythonUDF = udf(lambda x: x).asNondeterministic()
spark.range(10)\
.selectExpr("id", "id % 3 as value")\
.groupBy(pythonUDF(col("value")))\
.agg(avg("id"), pythonUDF(col("value")))\
.explain(extended=True)
```
Currently results in a plan like this:
```
Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14)
+- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15)
+- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L)
+- Range (0, 10, step=1, splits=Some(2))
```
and then it throws:
```
[[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803
```
- how canonicalized fixes this:
- nondeterministic PythonUDF expressions always have distinct resultIds per udf
- The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions.
- for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected
### Why are the changes needed?
- the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project.
### Does this PR introduce _any_ user-facing change?
Yes, it's additive, it enables queries to run that previously threw errors.
### How was this patch tested?
- added unit test
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic.
Authored-by: Ben Hurdelhey <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
pull bot
pushed a commit
that referenced
this pull request
Nov 2, 2025
…building ### What changes were proposed in this pull request? This PR aims to add `libwebp-dev` to fix `dev/infra/Dockerfile` building. ### Why are the changes needed? To fix `build_infra_images_cache` GitHub Action job - https://github.com/apache/spark/actions/workflows/build_infra_images_cache.yml <img width="545" height="88" alt="Screenshot 2025-11-02 at 14 56 19" src="https://github.com/user-attachments/assets/f70d6093-6574-40f3-a097-ba5c9086f3c1" /> The root cause is identical with other Dockerfile failure. ``` #13 578.4 -------------------------- [ERROR MESSAGE] --------------------------- #13 578.4 <stdin>:1:10: fatal error: ft2build.h: No such file or directory #13 578.4 compilation terminated. #13 578.4 -------------------------------------------------------------------- #13 578.4 ERROR: configuration failed for package 'ragg' ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Especially, `Cache base image` test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52840 from dongjoon-hyun/SPARK-54141. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD
pushed a commit
that referenced
this pull request
Nov 25, 2025
…building ### What changes were proposed in this pull request? This PR aims to add `libwebp-dev` to fix `dev/infra/Dockerfile` building. ### Why are the changes needed? To fix `build_infra_images_cache` GitHub Action job - https://github.com/apache/spark/actions/workflows/build_infra_images_cache.yml <img width="545" height="88" alt="Screenshot 2025-11-02 at 14 56 19" src="https://github.com/user-attachments/assets/f70d6093-6574-40f3-a097-ba5c9086f3c1" /> The root cause is identical with other Dockerfile failure. ``` #13 578.4 -------------------------- [ERROR MESSAGE] --------------------------- #13 578.4 <stdin>:1:10: fatal error: ft2build.h: No such file or directory #13 578.4 compilation terminated. #13 578.4 -------------------------------------------------------------------- #13 578.4 ERROR: configuration failed for package 'ragg' ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Especially, `Cache base image` test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52840 from dongjoon-hyun/SPARK-54141. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )