[pull] master from apache:master #50

pull · 2023-12-22T15:12:53Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…ne table expressions ### What changes were proposed in this pull request? With this PR proposal is to do inline table resolution in two phases: 1) If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule. 2) Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase. ### Why are the changes needed? This PR aims to solve two problems with inline tables. Example1: ```sql SELECT COUNT(DISTINCT ct) FROM VALUES (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()) as data(ct) ``` Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1. Example 2: ```sql CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP()) ``` In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution. ### Does this PR introduce _any_ user-facing change? See section above. ### How was this patch tested? New test that validates this behaviour is introduced. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44316 from dbatomic/inline_tables_curr_time_fix. Lead-authored-by: Aleksandar Tomic <[email protected]> Co-authored-by: Aleksandar Tomic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This is a followup of #44310 . It turns out that `TreeNodeTag` in `Project` is way too fragile. `Project` is a very basic node and very easy to get removed/transformed during plan optimization. This PR switches to a different approach: since we can't retain the information (input data order doesn't matter) from `Aggregate`, let's leverage this information immediately. We pull out the expensive part of `EliminateSorts` to a new rule, so that we can safely call `EliminateSorts` right before we turn `Aggregate` into `Project`. ### Why are the changes needed? to make the optimizer more robust. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44429 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ctor out `test_arithmetic_*` ### What changes were proposed in this pull request? Factor out `test_arithmetic_*` from `OpsOnDiffFramesEnabledTests` ### Why are the changes needed? `OpsOnDiffFramesEnabledTests` and its parity test are slow: ``` Starting test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames (temp output: /__w/spark/spark/python/target/6b1d192e-052f-42d4-9023-04df84120fce/python3.9__pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames__gycsek91.log) Finished test(python3.9): pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames (740s) ``` break it into small tests to be more suitable for parallelism ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44435 from zhengruifeng/ps_test_diff_ops_arithmetic. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…or timestamp ntz ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44428 from cloud-fan/ntz. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>

…as.tests.series.*` ### What changes were proposed in this pull request? Move `test_series_datetime` to `pyspark.pandas.tests.series.*` ### Why are the changes needed? move the test to the right place ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44434 from zhengruifeng/ps_test_mv_ser_ts. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…test` directory ### What changes were proposed in this pull request? This pr move `IvyTestUtils` back to `src/test` directory because it has been in the `src/test` directory before the refactoring work of #43354. Meanwhile, in order to make the `core` and `connect-client-jvm` module use `IvyTestUtils` in the tests, this pr has added the corresponding Maven dependencies in the respective `pom.xml` files. ### Why are the changes needed? Move `IvyTestUtils` back to `src/test` directory ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44440 from LuciferYang/mv-IvyTestUtils-to-test-dir. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>

…dTests`: Factor out `test_arithmetic_chain_*` ### What changes were proposed in this pull request? Factor out `test_arithmetic_chain_*` ### Why are the changes needed? for testing parallelism ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44443 from zhengruifeng/ps_test_diff_ops_arithmetic_chain. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…mentation ### What changes were proposed in this pull request? - Remove a bunch of Liquid directives that are not necessary. - Add a table of contents to the built-in SQL functions page. - Move the generated HTML for built-in SQL functions to a subdirectory. ### Why are the changes needed? To reduce confusion for maintainers. ### Does this PR introduce _any_ user-facing change? Yes. It adds a table of contents and change the heading style of the examples. Otherwise, the generated docs are identical. ### How was this patch tested? I built Spark, ran `./sql/create-docs.sh`, and reviewed the generated docs in my browser. The page is too long to screenshot completely, but here are a couple of screenshots. <img width="500" alt="Screenshot 2023-12-20 at 7 06 36 PM" src="https://github.com/apache/spark/assets/1039369/b285d8a2-6eab-488d-9e28-2fdc9cc833a9"> <img width="500" alt="Screenshot 2023-12-20 at 7 06 48 PM" src="https://github.com/apache/spark/assets/1039369/2f9670bc-773a-48a8-a0d0-54206b8a4887"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44393 from nchammas/sql-builtin-funcs-cruft. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… DSv2 ### What changes were proposed in this pull request? This PR adds initial support for Python data source write by implementing the DSv2 `SupportsWrite` interface for `PythonTableProvider`. Note this PR only supports the `def write(self, iterator)` API. `commit` and `abort` will be supported in [SPARK-45914](https://issues.apache.org/jira/browse/SPARK-45914). ### Why are the changes needed? To support Python data source APIs. For instance: ```python class SimpleWriter(DataSourceWriter): def write(self, iterator: Iterator[Row]) -> WriterCommitMessage: for row in iterator: print(row) return WriterCommitMessage() class SimpleDataSource(DataSource): def writer(self, schema, overwrite): return SimpleWriter() # Regsiter the Python data source spark.dataSource.register(SimpleDataSource) df.range(10).write.format("SimpleDataSource").mode("append").save() ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR supports writing data into a Python data source. ### How was this patch tested? New unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43791 from allisonwang-db/spark-45525-data-source-write. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ate without grouping keys ### What changes were proposed in this pull request? As Aggregates with no grouping keys always return 1 row (can be NULL), an EXISTs over such subquery should always return true. This reverts some changes done when we migrated EXISTS/IN to DecorrelateInnerQuery framework, in particular the static detection of potential count bug aggregates is removed (just having an empty grouping key should trigger the count bug treatment now; scalar subqueries still have extra checks that are evaluating the aggregate on an empty input). I suspect the same correctness problem was present in the legacy framework (added one test in the legacy section of exists-count-bug.sql) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #44451 from agubichev/SPARK-46468_count. Authored-by: Andrey Gubichev <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…s and ResolveInlineTablesSuite ### What changes were proposed in this pull request? #44316 replace current time/date prior to evaluating inline table expressions. This PR propose to simplify the code for `ResolveInlineTables` and let `ResolveInlineTablesSuite` apply the rule `ResolveInlineTables`. ### Why are the changes needed? Simplify the code for `ResolveInlineTables` and `ResolveInlineTablesSuite`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Test cases updated. GA tests. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44447 from beliefer/SPARK-46380_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ialect ### What changes were proposed in this pull request? This PR fix a but by make JDBC dialect decide the decimal precision and scale. **How to reproduce the bug?** #44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`. The bug fired when pushdown the below SQL to H2 JDBC. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` **The root cause** `getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`. The query for database H2 show below. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` We can get the five variables from `ResultSetMetaData`, please refer: ``` columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)" dataType = 2 typeName = "NUMERIC" fieldSize = 100000 fieldScale = 50000 ``` Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually. The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`. At finally, `makeGetter` throws exception. ``` Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339) ``` ### Why are the changes needed? This PR fix the bug that `JdbcUtils` can't get the correct decimal type. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? Manual tests in #44397 ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44398 from beliefer/SPARK-46443. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…append/array_insert` ### What changes were proposed in this pull request? This pr refine docstring of `array_prepend/array_append/array_insert` and add some new examples. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44436 from LuciferYang/SPARK-46472. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>

### What changes were proposed in this pull request? https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#spark-properties https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties As listed above, the doc content in 3.5.0 cannot scroll horizontally. Users can only see the rest of its content when a table overflows if they zoom out as much as possible, resulting in hard-to-read minor characters. This PR changes the HTML body overflow-x from hidden to auto to enable the underlying table to scroll horizontally. ### Why are the changes needed? Fix documentation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### Before ![image](https://github.com/apache/spark/assets/8326978/437bee91-ab0d-4616-aaaf-f99171dcf9f9) #### After ![image](https://github.com/apache/spark/assets/8326978/327ed82b-3e14-4a27-be1a-835a7b21c000) ### Was this patch authored or co-authored using generative AI tooling? no Closes #44423 from yaooqinn/SPARK-46464. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? This pr adds a check: we only mark the cached partition is materialized if the task is not failed and not interrupted. And adds a new method `isFailed` in `TaskContext`. ### Why are the changes needed? Before this pr, when do cache, task failure can cause NPE in other tasks ``` java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:396) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.accessors1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:155) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) ``` ### Does this PR introduce _any_ user-facing change? yes, it's a bug fix ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44445 from ulysses-you/fix-cache. Authored-by: ulysses-you <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? This change adds the following conversions to the vectorized and non-vectorized Parquet readers corresponding to type promotions that are strictly widening without precision loss: - Int -> Long - Float -> Double - Int -> Double - Date -> TimestampNTZ (Timestamp without timezone only as a date has no timezone information) - Decimal with higher precision (already supported in non-vectorized reader) ### Why are the changes needed? These type promotions support two similar use cases: 1. Reading a set of Parquet files with different types, e.g a mix of Int and Long for a given column. If the read schema is Long, the reader should be able to read all files and promote Ints to Longs instead of failing. 2. Widening the type of a column in a table that already contains Parquet files, e.g. an Int column isn't large enough to accommodate IDs and is changed to Long. Existing Parquet files storing the value as Int should still be correctly read by upcasting values. The second use case in particular will enable widening the type of columns or fields in existing Delta tables. ### Does this PR introduce _any_ user-facing change? The following fails before this change: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` With the Int->Long promotion in both the vectorized and non-vectorized parquet readers, it succeeds and produces correct results, without overflow or loss of precision. The same is true for Float->Double, Int->Double, Decimal with higher precision and Date->TimestampNTZ ### How was this patch tested? - Added `ParquetTypeWideningSuite` covering the promotions included in this change, in particular: - Non-dictionary encoded values / dictionary encoded values for each promotion - Timestamp rebase modes `LEGACY` and `CORRECTED` for Date -> TimestampNTZ - Promotions between decimal types with different physical storage: `INT32`, `INT64`, `BINARY`, `FIXED_LEN_BYTE_ARRAY` - Reading values written with Parquet V1 and V2 writers. - Updated/Removed two tests that expect type promotion to fail. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44368 from johanl-db/SPARK-40876-parquet-type-promotion. Authored-by: Johan Lasperas <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? In `V1Writes`, we try to avoid adding Sort if the output ordering always satisfies. However, the code is completely broken with two issues: - we put `SortOrder` as the child of another `SortOrder` and compare, which always returns false. - once we add a project to do `empty2null`, we change the query output attribute id and the sort order never matches. It's not a big issue as we still have QO rules to eliminate useless sorts, but #44429 exposes this problem because the way we optimize sort is a bit different. For `V1Writes`, we should always avoid adding sort even if the number of ordering key is less, to not change the user query. ### Why are the changes needed? fix code mistakes. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44458 from cloud-fan/sort. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new #39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic #41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial #47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc #50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc #55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc #56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic #57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual #61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic #39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic #43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dbatomic and others added 16 commits December 21, 2023 15:58

github-actions bot added BUILD CORE SQL DOCS PYTHON PANDAS API ON SPARK CONNECT labels Dec 22, 2023

pull bot removed BUILD CORE SQL DOCS PYTHON PANDAS API ON SPARK CONNECT labels Dec 22, 2023

pull bot added the ⤵️ pull label Dec 22, 2023

github-actions bot added BUILD CORE SQL DOCS PYTHON PANDAS API ON SPARK CONNECT labels Dec 22, 2023

pull bot merged commit cb2f47b into huangxiaopingRD:master Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #50

[pull] master from apache:master #50

Uh oh!

pull bot commented Dec 22, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

[pull] master from apache:master #50

[pull] master from apache:master #50

Uh oh!

Conversation

pull bot commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

pull bot commented Dec 22, 2023 •

edited

Loading