Test py #3

dcoliversun · 2022-08-09T22:56:06Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…sting date to decimal ### What changes were proposed in this pull request? This PR is a followup of apache#37389 which disables ANSI mode when testing a case from date to decimal. ### Why are the changes needed? To make the test pass. Currently it fails with ANSI mode on, see also https://github.com/apache/spark/runs/7701218236?check_suite_focus=true. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? I manually ran the test in my local. Closes apache#37426 from HyukjinKwon/SPARK-39963. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

… close after use ### What changes were proposed in this pull request? This pr use `try-with-resource` to ensure `DBIterator` is close after use in `RemoteBlockPushResolver`, `YarnShuffleService` and `ExternalShuffleBlockResolver` to avoid resource leakage. ### Why are the changes needed? Avoid `DBIterator` resource leakage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#37420 from LuciferYang/close-dbiterator. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR expect to improve the comments about null tracking for `UnsafeRow`. The old comment of `UnsafeRow` have confused text `[null bit set]`. In fact, the portion is a lot of bit or bit array which does't always be null. On the other hand, it tell users nothing. we need the information more clear. ### Why are the changes needed? Improve the comments about null tracking for `UnsafeRow`. ### Does this PR introduce _any_ user-facing change? 'No'. Just update comments. ### How was this patch tested? N/A Closes apache#37340 from beliefer/UnsafeRow_comment. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR updates `PhysicalOperation` to make it the same as `ScanOperation`, then remove `ScanOperation` and replace all its usages with `PhysicalOperation`. It also adds a new pattern match `NodeWithOnlyDeterministicProjectAndFilter` and uses it in places where we only need to extract a relation, but no need to collect projects and filters. ### Why are the changes needed? `PhysicalOperation` has known issues: it aggressively collapses projects and filters, which may lead to bad query plans. To fix this issue, we introduced `ScanOperation` and use it in a few critical places like scan strategies. To be conservative, we didn't replace `PhysicalOperation` with `ScanOperation` everywhere. However, `PhysicalOperation` has performance issues in itself when collapsing projects and merging expressions, if the query plan is very complicated. We should always follow the rule `CollpaseProjects` and stop merging expressions when it duplicates expensive expressions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#37176 from cloud-fan/qo. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…set to 0 ### What changes were proposed in this pull request? This PR proposes to: - Suppress error logs when the number of timers is set to 0. - Add documentation for describing its behaviour clearly: - 0: disable the feature - -1: no limit - Otherwise, explicit limit. - Throw an exception if the configuration is set to the number lower than -1. This is an internal configuration, numbers lower than -1 don't make much sense, and this configuration sort of less known. So I think it's pretty safe to do this. ### Why are the changes needed? To avoid noisy error logs, and document the feature properly. ### Does this PR introduce _any_ user-facing change? Yes. When `spark.scheduler.listenerbus.metrics.maxListenerClassesTimed` is set to `0`, it does not show a warning such as: ``` LiveListenerBusMetrics: Not measuring processing time for listener class org.apache.spark.sql.util.ExecutionListenerBus because a maximum of 0 listener classes are already timed. ``` ### How was this patch tested? Unittest is added. Closes apache#37432 from HyukjinKwon/SPARK-39973. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? `CatalogImpl` has been updated quite a bit recently, to support v2 catalogs. This PR revisits the recent changes and refines the code a little bit: 1. fix the naming "3 layer namespace". The spark catalog plugin supports n-part namespace. This PR changes it to `qualified name with catalog`. 2. always use the v2 code path. Today the v2 code path can already cover all the functionalities of `CatalogImpl` and it's unnecessary to keep the v1 code path in `CatalogImpl`. It also makes sure the behavior is consistent between `db.table` and `spark_catalog.db.table`. Previously it was not consistent in some cases, see the updated tests for functions. 3. simplify `try {v1 code path} catch {... v2 code path}` to `val name = if (table exists in HMS) {name qualified with spark_catalog} else {parsed name}; v2 code path` ### Why are the changes needed? code cleanup. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#37287 from cloud-fan/catalog. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR aims to skip PVC cleanup logic when `spark.kubernetes.driver.ownPersistentVolumeClaim=false`. ### Why are the changes needed? To simplify Spark termination log by removing unnecessary log containing Exception message when Spark jobs have no PVC permission and at the same time `spark.kubernetes.driver.ownPersistentVolumeClaim` is `false`. ### Does this PR introduce _any_ user-facing change? Only in the termination logs of Spark jobs that has no PVC permission. ### How was this patch tested? Manually. Closes apache#37433 from dongjoon-hyun/SPARK-39965. Lead-authored-by: Dongjoon Hyun <[email protected]> Co-authored-by: pralabhkumar <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…hResolver` ### What changes were proposed in this pull request? This pr just remove redundant `LevelDB.get(key)` in `RemoteBlockPushResolver`. ### Why are the changes needed? This is no need to check `db.get(key) != null` before `db.delete(key)`. LevelDB will handle this scene. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Actions Closes apache#37435 from LuciferYang/remove-redundant-get. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? expose function `median` in `functions` ### Why are the changes needed? to support `meidan` in PySpark ### Does this PR introduce _any_ user-facing change? yes, new API ### How was this patch tested? doctest Closes apache#37434 from zhengruifeng/py_add_median. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using ntile. ### Why are the changes needed? In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268 #### Example Assume this data: ``` create table t1 stored as parquet as select * from range(101); ``` Also assume this query: ``` select id, ntile(10) over (order by id) as nt from t1 limit 10; ``` With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |2 | |2 |3 | |3 |4 | |4 |5 | |5 |6 | |6 |7 | |7 |8 | |8 |9 | |9 |10 | +---+---+ ``` With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ ntile. Spark 3.1.3: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |1 | |2 |1 | |3 |1 | |4 |1 | |5 |1 | |6 |1 | |7 |1 | |8 |1 | |9 |1 | +---+---+ ``` Hive 2.3.9: ``` +-----+-----+ | id | nt | +-----+-----+ | 0 | 1 | | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 1 | | 8 | 1 | | 9 | 1 | +-----+-----+ 10 rows selected (1.72 seconds) ``` Prestodb 0.268: ``` id | nt ----+---- 0 | 1 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 (10 rows) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Two new unit tests. Closes apache#37443 from bersprockets/pushdown_ntile. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ained ### What changes were proposed in this pull request? This PR proposes to improve the examples in `pyspark.sql.group` by making each example self-contained with a brief explanation and a bit more realistic example. ### Why are the changes needed? To make the documentation more readable and able to copy and paste directly in PySpark shell. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documentation ### How was this patch tested? Manually ran each doctest. Closes apache#37437 from HyukjinKwon/SPARK-40006. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? expose 'mode' to `functions` ### Why are the changes needed? to support `mode` in PySpark ### Does this PR introduce _any_ user-facing change? yes, new API ### How was this patch tested? added doctest Closes apache#37438 from zhengruifeng/py_add_mode. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…aging (Sort with expressions) ### What changes were proposed in this pull request? Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. ### Why are the changes needed? Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose 1. to support casting of the integral type: `TINYINT`, `SMALLINT`, `INT`, `BIGINT` to ANSI interval types. 2. and remove the restriction of casting only single unit intervals. This PR is complement to apache#36811. ### Why are the changes needed? To conform the SQL standard which allows such casting: <img width="801" alt="173228149-17e1fbaa-c095-4eb7-bb3b-81a3f9c91928" src="https://user-images.githubusercontent.com/1580697/183467069-73b2b4e4-6c34-4395-beb8-f1d721483f43.png"> ### Does this PR introduce _any_ user-facing change? No, it extends existing behavior. ### How was this patch tested? By running new tests: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z cast.sql" ``` Closes apache#37442 from MaxGekk/cast-integral-to-interval. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

### What changes were proposed in this pull request? This PR aims to upgrade to Hadoop 3.3.4, which was just announced today. ### Why are the changes needed? Hadoop 3.3.4 comes with many bug fixes as well as CVE fixes. Please check [release notes](http://hadoop.apache.org/docs/r3.3.4/hadoop-project-dist/hadoop-common/release/3.3.4/RELEASENOTES.3.3.4.html) and [change log](http://hadoop.apache.org/docs/r3.3.4/hadoop-project-dist/hadoop-common/release/3.3.4/CHANGELOG.3.3.4.html). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#37281 from sunchao/SPARK-39863/hadoop-3.3.4. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to update the dependency manifest for Hadoop 3. [HADOOP-18344](https://issues.apache.org/jira/browse/HADOOP-18344) changes AWS SDK at Apache Hadoop 3.3.4 RC1. ### Why are the changes needed? apache#37281 missed this inconsistency. ### Does this PR introduce _any_ user-facing change? No. This will recover the dependency check CI job. ### How was this patch tested? Pass the CI on this job. Closes apache#37447 from dongjoon-hyun/SPARK-39863. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…tained add spark define config master and appName for sparkSession SparkSession initialized in test method set spark in globs

…ly equivalent children in `RewriteDistinctAggregates` ### What changes were proposed in this pull request? In `RewriteDistinctAggregates`, when grouping aggregate expressions by function children, treat children that are semantically equivalent as the same. ### Why are the changes needed? This PR will reduce the number of projections in the Expand operator when there are multiple distinct aggregations with superficially different children. In some cases, it will eliminate the need for an Expand operator. Example: In the following query, the Expand operator creates 3\*n rows (where n is the number of incoming rows) because it has a projection for each of function children `b + 1`, `1 + b` and `c`. ``` create or replace temp view v1 as select * from values (1, 2, 3.0), (1, 3, 4.0), (2, 4, 2.5), (2, 3, 1.0) v1(a, b, c); select a, count(distinct b + 1), avg(distinct 1 + b) filter (where c > 0), sum(c) from v1 group by a; ``` The Expand operator has three projections (each producing a row for each incoming row): ``` [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for regular aggregation) [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for distinct aggregation of b + 1) [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for distinct aggregation of 1 + b) ``` In reality, the Expand only needs one projection for `1 + b` and `b + 1`, because they are semantically equivalent. With the proposed change, the Expand operator's projections look like this: ``` [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular aggregations) [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct aggregation on b + 1 and 1 + b) ``` With one less projection, Expand produces 2\*n rows instead of 3\*n rows, but still produces the correct result. In the case where all distinct aggregates have semantically equivalent children, the Expand operator is not needed at all. Benchmark code in the JIRA (SPARK-40382). Before the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 14721 14859 195 5.7 175.5 1.0X some semantically equivalent 14569 14572 5 5.8 173.7 1.0X none semantically equivalent 14408 14488 113 5.8 171.8 1.0X ``` After the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 3658 3692 49 22.9 43.6 1.0X some semantically equivalent 9124 9214 127 9.2 108.8 0.4X none semantically equivalent 14601 14777 250 5.7 174.1 0.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes apache#37825 from bersprockets/rewritedistinct_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes apache#40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

HyukjinKwon and others added 16 commits August 7, 2022 17:52

github-actions bot added BUILD CORE SQL PYTHON labels Aug 9, 2022

[SPARK-40010][PYTHON][DOCS] Make pyspark.sql.window examples self-con…

663b208

…tained add spark define config master and appName for sparkSession SparkSession initialized in test method set spark in globs

dcoliversun force-pushed the master branch from d77bc70 to 4aa0606 Compare August 9, 2022 23:22

dcoliversun force-pushed the test-py branch from 12b6515 to 663b208 Compare August 9, 2022 23:23

dcoliversun closed this Aug 9, 2022

github-actions bot added KUBERNETES R labels Aug 9, 2022

dcoliversun deleted the test-py branch August 9, 2022 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test py #3

Test py #3

Uh oh!

dcoliversun commented Aug 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Test py #3

Test py #3

Uh oh!

Conversation

dcoliversun commented Aug 9, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants