forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
[pull] master from apache:master #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…he child plan when constructing Connect proto in the Python client ### What changes were proposed in this pull request? The `SubqueryAlias` in `plan.py` does not set child plan when constructing Connect proto. This PR fixes that issue by setting up the input field during proto construction. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38607 from amaliujia/fix_subquery_alias. Authored-by: Rui Wang <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…Connect proto ### What changes were proposed in this pull request? I was writing test cases to test expressions and realized that we can allow `Project` without input plan. For example, `SELECT 1` is a valid query. For SQL it will generate `OneRowRelation` to make up the input plan but for the Connect users they shouldn't need to bother appending that relation. Instead, they can just submit a Project with expressions. Per our design, Proto is also a API layer and anyone can draft a proto plan without using built-in clients. This PR will improve the proto usability for `Project`. ### Why are the changes needed? 1. Improve usability. 2. Help write test cases for expressions. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38632 from amaliujia/SPARK-41116. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… should pass last unclosed comment to backend` ### What changes were proposed in this pull request? Fix flaky test `spark-sql should pass last unclosed comment to backend`, it's timeout should be caused by unstable GA env, so I increase timeout time for the test ### Why are the changes needed? Fix flaky test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #38571 from AngersZhuuuu/SPARK-37555-FOLOWUP. Authored-by: Angerszhuuuu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Implement `DataFrame.show` ### Why are the changes needed? api coverage ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added ut Closes #38621 from zhengruifeng/connect_df_show. Lead-authored-by: Ruifeng Zheng <[email protected]> Co-authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…`/tmp` directory ### What changes were proposed in this pull request? This PR aims to make `entrypoint.sh` to use its WORKDIR instead of `/tmp` directory for its temporary file, `java_opts.txt`. ### Why are the changes needed? In some environment, `/tmp` is shared with other containers or even other pods. ### Does this PR introduce _any_ user-facing change? No, this `/tmp/java_opts.txt` file is not supposed to be shared by others. ### How was this patch tested? Pass the CIs (including K8s Integration Tests). Closes #38641 from dongjoon-hyun/SPARK-41126. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…INVALID_LIKE_PATTERN ### What changes were proposed in this pull request? In the PR, I propose to rename the legacy error class _LEGACY_ERROR_TEMP_1216 to INVALID_LIKE_PATTERN. ### Why are the changes needed? Proper names of error classes should improve user experience with Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites: > $ build/sbt "sql/testOnly *SQLQuerySuite" > $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z strings.sql" Closes #38615 from panbingkun/error_classes_invalid_like_pattern. Authored-by: panbingkun <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…_POS_AGGREGATE` ### What changes were proposed in this pull request? This PR proposes to rename `GROUP_BY_POS_REFERS_AGG_EXPR` to `GROUP_BY_POS_AGGREGATE` ### Why are the changes needed? The error class name should be simplificated as much as possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*” ``` Closes #38600 from itholic/SPARK-41098. Authored-by: itholic <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…bution.sh` and make `build/mvn` use the same JAVA_OPTS ### What changes were proposed in this pull request? This pr aims to make build/mvn use the same JAVA_OPTS as dev/make-distribution.sh to speed up mvn build. In addition, this pr deletes the duplicate `-Xmx4g` option from `make-distribution.sh` ### Why are the changes needed? To speed up mvn build. Run the following commands: ``` build/mvn clean -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -DskipTests build/mvn clean install -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -DskipTests ``` **Before** - x86 Linux: `[INFO] Total time: 01:22 h ([INFO] Spark Project Examples ............................. SUCCESS [41:30 min])` - M1 MacOs: `[INFO] Total time: 01:26 h ([INFO] Spark Project Examples ............................. SUCCESS [41:59 min])` **After** - x86 Linux: `[INFO] Total time: 29:11 min([INFO] Spark Project Examples ............................. SUCCESS [01:22 min])` - M1 MacOs: `[INFO] Total time: 25:14 min ([INFO] Spark Project Examples ............................. SUCCESS [01:08 min])` From above result, new JAVA_OPTS reduce mvn build time by 65%+ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38589 from LuciferYang/SPARK-41087. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR is a follow to fix Scala Style error. ### Why are the changes needed? Currently, the all CIs are broken due to this. To recover CIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #38645 from dongjoon-hyun/SPARK-41109. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This is a followup of #38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice. This PR partially reverts #38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert #38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join. ### Why are the changes needed? Fix the optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #38626 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ableSuite`
### What changes were proposed in this pull request?
This pr aims to fix error class order of `ESC_IN_THE_MIDDLE` and `ESC_AT_THE_END` to make GA task passed.
### Why are the changes needed?
Fix GA test task failed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass GA
- Manual test:
```
build/sbt "core/testOnly *SparkThrowableSuite"
```
**Before**
```
[info] - Error classes are correctly formatted *** FAILED *** (91 milliseconds)
[info] "...ass" : {
[info] "ESC_[AT_THE_END" : {
[info] "message" : [
[info] "the escape character is not allowed to end with."
[info] ]
[info] },
[info] "ESC_IN_THE_MIDDLE" : {
[info] "message" : [
[info] "the escape character is not allowed to precede <char>]."
[info] ]
[info] }..." did not equal "...ass" : {
[info] "ESC_[IN_THE_MIDDLE" : {
[info] "message" : [
[info] "the escape character is not allowed to precede <char>."
[info] ]
[info] },
[info] "ESC_AT_THE_END" : {
[info] "message" : [
[info] "the escape character is not allowed to end with]."
[info] ]
[info] }..." (SparkThrowableSuite.scala:98)
```
**After**
```
[info] SparkThrowableSuite:
[info] - No duplicate error classes (39 milliseconds)
[info] - Error classes are correctly formatted (61 milliseconds)
[info] - SQLSTATE invariants (13 milliseconds)
[info] - Message invariants (15 milliseconds)
[info] - Message format invariants (33 milliseconds)
[info] - Round trip (33 milliseconds)
[info] - Check if error class is missing (32 milliseconds)
[info] - Check if message parameters match message format (8 milliseconds)
[info] - Error message is formatted (1 millisecond)
[info] - Try catching legacy SparkError (1 millisecond)
[info] - Try catching SparkError with error class (1 millisecond)
[info] - Try catching internal SparkError (1 millisecond)
[info] - Get message in the specified format (15 milliseconds)
[info] - overwrite error classes (190 milliseconds)
[info] - prohibit dots in error class names (57 milliseconds)
[info] Run completed in 2 seconds, 317 milliseconds.
[info] Total number of tests run: 15
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 16 s, completed 2022-11-14 19:34:11
```
Closes #38658 from LuciferYang/SPARK-41109-FOLLOWUP.
Authored-by: yangjie01 <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA. This PR proposes to read FLBA column as BinaryType data in Spark. ### Why are the changes needed? Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test added Closes #38628 from kazuyukitanimura/SPARK-41096. Authored-by: Kazuyuki Tanimura <[email protected]> Signed-off-by: Chao Sun <[email protected]>
… batch in main thread ### What changes were proposed in this pull request? Document the reason of sending batch in main thread ### Why are the changes needed? as per #38613 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? no, doc-only Closes #38654 from zhengruifeng/connect_doc_collect. Lead-authored-by: Ruifeng Zheng <[email protected]> Co-authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? The pr aims to upgrade joda-time from 2.12.0 to 2.12.1 ### Why are the changes needed? The new version brings some bug fix, the release notes as follows: https://www.joda.org/joda-time/changes-report.html#a2.12.1 <img width="451" alt="image" src="https://user-images.githubusercontent.com/15246973/201499721-36749120-0ca3-4b3f-8614-06089178ab26.png"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #38636 from panbingkun/upgrade_joda. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? The pr aims to upgrade mysql-connector-java from 8.0.30 to 8.0.31. From v8.0.31 the artifact was moved to: [com.mysql](https://mvnrepository.com/artifact/com.mysql) » [mysql-connector-j](https://mvnrepository.com/artifact/com.mysql/mysql-connector-j) <img width="839" alt="image" src="https://user-images.githubusercontent.com/15246973/201521236-d17cbdb4-080e-4b75-8667-b33f25ad613b.png"> ### Why are the changes needed? This version brings some bugs fixes, the release note as follows: https://dev.mysql.com/doc/relnotes/connector-j/8.0/en/news-8-0-31.html Bugs Fixed, eg: > Executing a PreparedStatment after applying setFetchSize(0) on it caused an ArrayIndexOutOfBoundsException. (Bug #104753, Bug #33286177) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #38639 from panbingkun/upgrade_mysql_connector. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…-PROTOBUF ### What changes were proposed in this pull request? - Adding message classname support for pyspark-protobuf - functions from_protobuf and to_protobuf ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - Python Doctests. - To avoid adding an extra jar, I used the existing spark-protobuf jar because it has a shadded google-protobuf and used a Timestamp message as an example protobuf messageClassName in Doctests rangadi HyukjinKwon Closes #38603 from SandishKumarHN/classname-support-pyspark-protobuf. Authored-by: SandishKumarHN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Spark is an open system where users can use plugins in different places. When people hit a bug, it may not come from Spark, but from these plugins. We should make this clear in the error message. ### Why are the changes needed? better error message ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #38648 from cloud-fan/internal. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ingQueryException` ### What changes were proposed in this pull request? In the PR, I propose to change the exception `StreamingQueryException` by extending `SparkThrowable` and new error class `STREAM_FAILED` which is set on raising `StreamingQueryException` on any stream failures. ### Why are the changes needed? To be consistent to other Spark's exception like `SparkException`, and to improve user experience with Spark APIs. ### Does this PR introduce _any_ user-facing change? Yes, it extends the existing API. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite" ``` Closes #38629 from MaxGekk/StreamingQueryException-SparkThrowable. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…JOIN_TYPE` ### What changes were proposed in this pull request? This PR proposes to rename `LATERAL_JOIN_OF_TYPE` to `INVALID_LATERAL_JOIN_TYPE`. Also remove this from the sub-class under `UNSUPPORTED FEATURE`, and improve the error messages. ### Why are the changes needed? This should not belongs to `UNSUPPORTED FEATURE`, since RIGHT and FULL outer join makes no sense for LATERAL joins. ### Does this PR introduce _any_ user-facing change? The error message improved From ``` The feature is not supported: RIGHT OUTER JOIN with LATERAL correlation. ``` To ``` The RIGHT OUTER JOIN with LATERAL correlation is not allowed because an OUTER subquery cannot correlate to its join partner. Remove the LATERAL correlation or use an INNER JOIN, or LEFT OUTER JOIN instead. ``` ### How was this patch tested? ``` ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*” ``` Closes #38652 from itholic/SPARK-41137. Authored-by: itholic <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…n Python client ### What changes were proposed in this pull request? This PR adds `CreateGlobalTempView` and `CreateOrReplaceGlobalTempView` to Python DataFrame API. Meanwhile, this PR extends `LogicalPlan` to let it have the ability to deal with `Command`. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38642 from amaliujia/create_temp_view_in_python. Authored-by: Rui Wang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ith in-subquery filter ### What changes were proposed in this pull request? Apply ColumnPruning for in subquery filter. Note that, the bloom filter side has already fixed by #36047 ### Why are the changes needed? The inferred in-subquery filter should apply ColumnPruning before get plan statistics and check if can be broadcasted. Otherwise, the final physical plan will be different from expected. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes #38619 from ulysses-you/SPARK-41112. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Skip `UnresolvedHint` in rule `AddMetadataColumns` to avoid call exprId on `UnresolvedAttribute`. ### Why are the changes needed? ``` CREATE TABLE t1(c1 bigint) USING PARQUET; CREATE TABLE t2(c2 bigint) USING PARQUET; SELECT /*+ hash(t2) */ * FROM t1 join t2 on c1 = c2; ``` failed with msg: ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:147) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$4$adapted(Analyzer.scala:1005) at scala.collection.Iterator.exists(Iterator.scala:969) at scala.collection.Iterator.exists$(Iterator.scala:967) at scala.collection.AbstractIterator.exists(Iterator.scala:1431) at scala.collection.IterableLike.exists(IterableLike.scala:79) at scala.collection.IterableLike.exists$(IterableLike.scala:78) at scala.collection.AbstractIterable.exists(Iterable.scala:56) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:1005) at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:1005) ``` But before just a warning: `WARN HintErrorLogger: Unrecognized hint: hash(t2)` ### Does this PR introduce _any_ user-facing change? yes, fix regression from 3.3.1. Note, the root reason is we mark `UnresolvedHint` is resolved if child is resolved since #32841, then #37758 trigger this bug. ### How was this patch tested? add test Closes #38662 from ulysses-you/hint. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Proposing syntax INSERT INTO tbl REPLACE whereClause identifierList to the spark SQL, as the equivalent of [dataframe.overwrite()](https://github.com/apache/spark/blob/35d00df9bba7238ad4f409999617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163) command. For Example INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2 will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Add Unit Test in [DataSourceV2SQLSuite.scala](9429a6c#diff-eeb429a8e3eb55228451c8dbc2fccca044836be608d62e9166561b005030c940) Closes #38404 from carlfu-db/replacewhere. Lead-authored-by: carlfu-db <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…aFrame.na.fill ` ### What changes were proposed in this pull request? Implement `DataFrame.fillna ` and `DataFrame.na.fill` ### Why are the changes needed? For API coverage ### Does this PR introduce _any_ user-facing change? yes, new methods ### How was this patch tested? added UT Closes #38653 from zhengruifeng/connect_df_na_fill. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…o `INVALID_WHERE_CONDITION` ### What changes were proposed in this pull request? In the PR, I propose to assign the proper name `INVALID_WHERE_CONDITION ` to the legacy error class `_LEGACY_ERROR_TEMP_2440 `, and modify test suite to use `checkError()` which checks the error class name, context and etc. Also this PR improves the error message. ### Why are the changes needed? Proper name improves user experience w/ Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes an user-facing error message. ### How was this patch tested? By running the modified tests: ``` $ build/sbt "test:testOnly *AnalysisErrorSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes #38656 from MaxGekk/invalid-where-condition-2. Lead-authored-by: Max Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? This is a follow-up improving the behavior and compatibility for aggregate relations using Spark Connect. Previously, Spark Connect would not retain the group by columns in the aggregation similar to very old Spark behavior that is by default not set. ### Why are the changes needed? Compatibility. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38627 from grundprinzip/SPARK-40875-f. Lead-authored-by: Martin Grund <[email protected]> Co-authored-by: Martin Grund <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…nto error classes ### What changes were proposed in this pull request? This pr aims to replaces TypeCheckFailure by DataTypeMismatch in type checks in the number formatting or parsing expressions, includes: 1. ToNumber (1): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/numberFormatExpressions.scala#L83 2. ToCharacter (1): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/numberFormatExpressions.scala#L227 3. ToNumberParser (1): https://github.com/apache/spark/blob/5556cfc59aa97a3ad4ea0baacebe19859ec0bcb7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ToNumberParser.scala#L262 ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? 1. Add new UT 2. Update existed UT 3. Pass GA Closes #38531 from panbingkun/SPARK-40755. Authored-by: panbingkun <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? This pr aims to upgrade sbt from 1.7.3 to 1.8.0 ### Why are the changes needed? This version fix 2 CVE: - Updates to Coursier 2.1.0-RC1 to address GHSA-wv7w-rj2x-556x - Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address GHSA-wv7w-rj2x-556x The full release notes as follows: - https://github.com/sbt/sbt/releases/tag/v1.8.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Run ` dev/sbt-checkstyle` passed Closes #38620 from LuciferYang/sbt-180. Lead-authored-by: YangJie <[email protected]> Co-authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…t must fit in 2GB ### What changes were proposed in this pull request? Single task result must fit in 2GB, because we're using byte array or `ByteBuffer`(which is backed by byte array as well), and thus has a limit of 2GB(java array size limit on `byte[]`). This PR is trying to fix this by replacing byte array with `ChunkedByteBuffer`. ### Why are the changes needed? To overcome the 2GB limit for single task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #38064 from liuzqt/SPARK-40622. Authored-by: Ziqi Liu <[email protected]> Signed-off-by: Mridul <mridul<at>gmail.com>
…t migration guide ### What changes were proposed in this pull request? This PR is a followup of #38257 to fix a typo from `spark.sql.legacy.skipPartitionSpecTypeValidation` to `spark.sql.legacy.skipTypeValidationOnAlterPartition`. ### Why are the changes needed? To show users the correct configuration name for legacy behaviours. ### Does this PR introduce _any_ user-facing change? No, doc-only. ### How was this patch tested? N/A Closes #38667 from HyukjinKwon/SPARK-40798-followup3. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
huangxiaopingRD
pushed a commit
that referenced
this pull request
Jan 4, 2023
…n Aggregate ### What changes were proposed in this pull request? This PR implements the implicit lateral column alias on `Aggregate` case. For example, ```sql -- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus) FROM employee GROUP BY dept ``` The high level implementation idea is to insert the `Project` node above, and falling back to the resolution of lateral alias of Project code path in the last PR. * Phase 1: recognize resolved lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference` * Phase 2: when the `Aggregate` operator is resolved, it goes through the whole aggregation list, extracts the aggregation expressions and grouping expressions to keep them in this `Aggregate` node, and add a `Project` above with the original output. It doesn't do anything on `LateralColumnAliasReference`, but completely leave it to the Project in the future turns of this rule. Example: ``` // Before rewrite: Aggregate [dept#14] [dept#14 AS a#12, 'a + 1, avg(salary#16) AS b#13, 'b + avg(bonus#17)] +- Child [dept#14,name#15,salary#16,bonus#17] // After phase 1: Aggregate [dept#14] [dept#14 AS a#12, lca(a) + 1, avg(salary#16) AS b#13, lca(b) + avg(bonus#17)] +- Child [dept#14,name#15,salary#16,bonus#17] // After phase 2: Project [dept#14 AS a#12, lca(a) + 1, avg(salary)#26 AS b#13, lca(b) + avg(bonus)#27] +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14] +- Child [dept#14,name#15,salary#16,bonus#17] // Now the problem falls back to the lateral alias resolution in Project. // After future rounds of this rule: Project [a#12, a#12 + 1, b#13, b#13 + avg(bonus)#27] +- Project [dept#14 AS a#12, avg(salary)#26 AS b#13] +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14] +- Child [dept#14,name#15,salary#16,bonus#17] ``` Similar as the last PR (apache#38776), because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`. ### Why are the changes needed? Similar as stated in apache#38776. ### Does this PR introduce _any_ user-facing change? Yes, as shown in the above example, it will be able to resolve lateral column alias in Aggregate. ### How was this patch tested? Existing tests and newly added tests. Closes apache#39040 from anchovYu/SPARK-27561-agg. Authored-by: Xinyi Yu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
huangxiaopingRD
pushed a commit
that referenced
this pull request
Jun 26, 2023
…onnect ### What changes were proposed in this pull request? Implement Arrow-optimized Python UDFs in Spark Connect. Please see apache#39384 for motivation and performance improvements of Arrow-optimized Python UDFs. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. In Spark Connect Python Client, users can: 1. Set `useArrow` parameter True to enable Arrow optimization for a specific Python UDF. ```sh >>> df = spark.range(2) >>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show() +------------+ |<lambda>(id)| +------------+ | 1| | 2| +------------+ # ArrowEvalPython indicates Arrow optimization >>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain() == Physical Plan == *(2) Project [pythonUDF0#18 AS <lambda>(id)#16] +- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200 +- *(1) Range (0, 2, step=1, splits=1) ``` 2. Enable `spark.sql.execution.pythonUDF.arrow.enabled` Spark Conf to make all Python UDFs Arrow-optimized. ```sh >>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True) >>> df.select(udf(lambda x : x + 1)('id')).show() +------------+ |<lambda>(id)| +------------+ | 1| | 2| +------------+ # ArrowEvalPython indicates Arrow optimization >>> df.select(udf(lambda x : x + 1)('id')).explain() == Physical Plan == *(2) Project [pythonUDF0#30 AS <lambda>(id)#28] +- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200 +- *(1) Range (0, 2, step=1, splits=1) ``` ### How was this patch tested? Parity unit tests. Closes apache#40725 from xinrong-meng/connect_arrow_py_udf. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )