[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode #30260

gengliangwang · 2020-11-05T07:56:38Z

What changes were proposed in this pull request?

In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types.

Comparing the ANSI CAST syntax rules with the current default behavior of Spark:

To make Spark's ANSI mode more ANSI SQL Compatible，I propose to disallow the following casting in ANSI mode:

TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct

The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now

Numeric <=> Boolean
String <=> Binary

Why are the changes needed?

Better ANSI SQL compliance

Does this PR introduce any user-facing change?

Yes, the following castings will not be allowed in ANSI mode:

TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct

How was this patch tested?

Unit test

The ANSI Compliance doc preview:

SparkQA · 2020-11-05T08:05:01Z

Test build #130642 has finished for PR 30260 at commit fe0fe48.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-05T08:42:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35252/

SparkQA · 2020-11-05T09:11:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35252/

SparkQA · 2020-11-05T14:25:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35261/

SparkQA · 2020-11-05T14:45:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35261/

SparkQA · 2020-11-05T18:06:03Z

Test build #130650 has finished for PR 30260 at commit d514f86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-11-06T06:43:31Z

Thanks for the request, @gengliangwang ! I'll review this later.

SparkQA · 2020-11-09T16:39:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35405/

SparkQA · 2020-11-09T17:02:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35405/

SparkQA · 2020-11-09T20:47:42Z

Test build #130796 has finished for PR 30260 at commit 18b49bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

Also, I think we need to update the migration guide, too.

maropu · 2020-11-10T01:57:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala


  override protected val ansiEnabled: Boolean = SQLConf.get.ansiEnabled
+
+  override def canCast(from: DataType, to: DataType): Boolean = if (ansiEnabled) {


How about describing this new behaviour in the usage above of ExpressionDescription?

well, then we need to mention about the behavior of throwing overflow exceptions when ANSI flag enabled. I will add some content in the sql-ref-ansi-compliance.md

maropu · 2020-11-10T02:02:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+  override def canCast(from: DataType, to: DataType): Boolean = AnsiCast.canCast(from, to)
+}
+
+object AnsiCast {


Could you leave some comments to summarize the current behaivour of the ANSI explicit cast as described in the PR description (references section 6.13 of the ANSI SQL standard and differences from the standard, e.g., Numeric <=> Boolean) ?

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

maropu · 2020-11-10T02:12:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+    case (_: NumericType, _: NumericType) => true
+    case (StringType, _: NumericType) => true
+    case (BooleanType, _: NumericType) => true


(Just a suggestion) For readability, could you reorder these entries according to Cast.canCast where possible? For example, the numeric entries for Cast seems to be placed just before complex types:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

Lines 70 to 74 in 35ac314

case (StringType, _: NumericType) => true

case (BooleanType, _: NumericType) => true

case (DateType, _: NumericType) => true

case (TimestampType, _: NumericType) => true

case (_: NumericType, _: NumericType) => true

maropu · 2020-11-10T02:18:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

    checkEvaluation(cast("abcd", DecimalType(38, 1)), null)
  }
+
+  test("SPARK-22825 Cast array to string") {


(nit and this is not related to this PR though...) SPARK-22825 -> SPARK-22825:

I would prefer keeping them unchanged in this PR

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

SparkQA · 2020-11-11T14:23:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35543/

gengliangwang · 2020-11-11T14:39:30Z

@maropu thanks for the suggestions. I have updated the code. Please take another look.

SparkQA · 2020-11-11T14:46:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35543/

SparkQA · 2020-11-11T15:19:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35546/

SparkQA · 2020-11-11T15:40:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35546/

SparkQA · 2020-11-18T14:43:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35888/

SparkQA · 2020-11-18T18:20:35Z

Test build #131283 has finished for PR 30260 at commit 6003bef.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AnsiCastSuiteWithAnsiModeOn extends AnsiCastSuiteBase
class AnsiCastSuiteWithAnsiModeOff extends AnsiCastSuiteBase

gengliangwang · 2020-11-18T18:31:15Z

retest this please.

SparkQA · 2020-11-18T19:15:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35902/

SparkQA · 2020-11-18T19:46:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35902/

SparkQA · 2020-11-18T23:34:05Z

Test build #131300 has finished for PR 30260 at commit 6003bef.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AnsiCastSuiteWithAnsiModeOn extends AnsiCastSuiteBase
class AnsiCastSuiteWithAnsiModeOff extends AnsiCastSuiteBase

maropu · 2020-11-19T00:23:58Z

Thanks! Merged to master.

dongjoon-hyun · 2020-11-19T00:26:59Z

Thank you, @gengliangwang and all!

revans2 · 2020-11-19T18:00:02Z

I just noticed that the error messages when trying to cast between timestamps and numeric types is wrong now in ANSI mode.

org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`c0` AS TIMESTAMP)' due to data type mismatch: cannot cast int to timestamp,you can enable the casting by setting spark.sql.legacy.allowCastNumericToTimestamp to true,but we strongly recommend using function TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS instead.;;

spark.sql.legacy.allowCastNumericToTimestamp has no impact on ANSI casts after this change. It looks like a fairly simple fix.

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

Line 276 in 3695e99

    
           if (child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType]) {

gengliangwang · 2020-11-19T18:13:07Z

@revans2 Thanks for reporting the problem. Actually, I am working on another PR to give suggestions for the disallowed ANSI cast. It will be sent out today and I will ping you in the PR.

revans2 · 2020-11-19T18:27:19Z

@gengliangwang sounds good. Thanks

…ericToTimestamp ### What changes were proposed in this pull request? Remove SQL configuration spark.sql.legacy.allowCastNumericToTimestamp ### Why are the changes needed? In the current master branch, there is a new configuration `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast Numeric types to Timestamp or not. The default value is true. After #30260, the type conversion between Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for disallowing the conversion. Users just need to set `spark.sql.ansi.enabled` for the behavior. As the configuration is not in any released yet, we should remove the configuration to make things simpler. ### Does this PR introduce _any_ user-facing change? No, since the configuration is not released yet. ### How was this patch tested? Existing test cases Closes #30493 from gengliangwang/LEGACY_ALLOW_CAST_NUMERIC_TO_TIMESTAMP. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? After #30260, there are some type conversions disallowed under ANSI mode. We should tell users what they can do if they have to use the disallowed casting. ### Why are the changes needed? Make it more user-friendly. ### Does this PR introduce _any_ user-facing change? Yes, the error message is improved on casting failure when ANSI mode is enabled ### How was this patch tested? Unit tests. Closes #30440 from gengliangwang/improveAnsiCastErrorMSG. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? After apache/spark#30260, there are some type conversions disallowed under ANSI mode. We should tell users what they can do if they have to use the disallowed casting. ### Why are the changes needed? Make it more user-friendly. ### Does this PR introduce _any_ user-facing change? Yes, the error message is improved on casting failure when ANSI mode is enabled ### How was this patch tested? Unit tests. Closes #30440 from gengliangwang/improveAnsiCastErrorMSG. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

…ce casting document ### What changes were proposed in this pull request? This is a follow-up of #30260 It shortens the table width of ANSI compliance casting document. ### Why are the changes needed? The table is too wide and the UI of doc site is broken if we scroll the page to right side. ![Screen Shot 2021-01-14 at 3 04 57 PM](https://user-images.githubusercontent.com/1097932/104565897-d2693b80-5601-11eb-9f93-5f603cfc94c1.png) ### Does this PR introduce _any_ user-facing change? Minor document change ### How was this patch tested? Build doc site locally and preview: ![Screen Shot 2021-01-14 at 4 44 30 PM](https://user-images.githubusercontent.com/1097932/104565814-b2d21300-5601-11eb-94c4-78c785cda8ed.png) Closes #31180 from gengliangwang/reviseAnsiDocStyle. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit feedd1b) Signed-off-by: Dongjoon Hyun <[email protected]>

…ce casting document ### What changes were proposed in this pull request? This is a follow-up of #30260 It shortens the table width of ANSI compliance casting document. ### Why are the changes needed? The table is too wide and the UI of doc site is broken if we scroll the page to right side. ![Screen Shot 2021-01-14 at 3 04 57 PM](https://user-images.githubusercontent.com/1097932/104565897-d2693b80-5601-11eb-9f93-5f603cfc94c1.png) ### Does this PR introduce _any_ user-facing change? Minor document change ### How was this patch tested? Build doc site locally and preview: ![Screen Shot 2021-01-14 at 4 44 30 PM](https://user-images.githubusercontent.com/1097932/104565814-b2d21300-5601-11eb-94c4-78c785cda8ed.png) Closes #31180 from gengliangwang/reviseAnsiDocStyle. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ce casting document ### What changes were proposed in this pull request? This is a follow-up of apache#30260 It shortens the table width of ANSI compliance casting document. ### Why are the changes needed? The table is too wide and the UI of doc site is broken if we scroll the page to right side. ![Screen Shot 2021-01-14 at 3 04 57 PM](https://user-images.githubusercontent.com/1097932/104565897-d2693b80-5601-11eb-9f93-5f603cfc94c1.png) ### Does this PR introduce _any_ user-facing change? Minor document change ### How was this patch tested? Build doc site locally and preview: ![Screen Shot 2021-01-14 at 4 44 30 PM](https://user-images.githubusercontent.com/1097932/104565814-b2d21300-5601-11eb-94c4-78c785cda8ed.png) Closes apache#31180 from gengliangwang/reviseAnsiDocStyle. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ailed in ansi mode ### What changes were proposed in this pull request? `Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](#30260), so the content of `exception.getMessage` is different from that of non-ans mode. This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA - Local Test **Before** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 23 seconds, 537 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` - Test that might_contain errors out non-constant Bloom filter *** FAILED *** "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch: cannot cast bigint to binary with ANSI mode on. If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false. ; line 2 pos 21; 'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)] +- SubqueryAlias t +- LocalRelation [a#2424L] " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171) ``` **After** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 26 seconds, 544 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 25 seconds, 289 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #35953 from LuciferYang/SPARK-32268-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

…ailed in ansi mode ### What changes were proposed in this pull request? `Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](#30260), so the content of `exception.getMessage` is different from that of non-ans mode. This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA - Local Test **Before** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 23 seconds, 537 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` - Test that might_contain errors out non-constant Bloom filter *** FAILED *** "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch: cannot cast bigint to binary with ANSI mode on. If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false. ; line 2 pos 21; 'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)] +- SubqueryAlias t +- LocalRelation [a#2424L] " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171) ``` **After** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 26 seconds, 544 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 25 seconds, 289 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #35953 from LuciferYang/SPARK-32268-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 7165123) Signed-off-by: Yuming Wang <[email protected]>

* [SPARK-32268][SQL] Row-level Runtime Filtering This PR proposes row-level runtime filters in Spark to reduce intermediate data volume for operators like shuffle, join and aggregate, and hence improve performance. We propose two mechanisms to do this: semi-join filters or bloom filters, and both mechanisms are proposed to co-exist side-by-side behind feature configs. [Design Doc](https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit?usp=sharing) with more details. With Semi-Join, we see 9 queries improve for the TPC DS 3TB benchmark, and no regressions. With Bloom Filter, we see 10 queries improve for the TPC DS 3TB benchmark, and no regressions. No Added tests Closes apache#35789 from somani/rf. Lead-authored-by: Abhishek Somani <[email protected]> Co-authored-by: Abhishek Somani <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 1f4e4c8) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-32268][TESTS][FOLLOWUP] Fix `BloomFilterAggregateQuerySuite` failed in ansi mode `Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](apache#30260), so the content of `exception.getMessage` is different from that of non-ans mode. This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent. Bug fix. No - Pass GA - Local Test **Before** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 23 seconds, 537 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` - Test that might_contain errors out non-constant Bloom filter *** FAILED *** "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch: cannot cast bigint to binary with ANSI mode on. If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false. ; line 2 pos 21; 'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)] +- SubqueryAlias t +- LocalRelation [a#2424L] " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171) ``` **After** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 26 seconds, 544 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 25 seconds, 289 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes apache#35953 from LuciferYang/SPARK-32268-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 7165123) Signed-off-by: Yuming Wang <[email protected]> * [SPARK-32268][SQL][FOLLOWUP] Add RewritePredicateSubquery below the InjectRuntimeFilter Add `RewritePredicateSubquery` below the `InjectRuntimeFilter` in `SparkOptimizer`. It seems if the runtime use in-subquery to do the filter, it won't be converted to semi-join as the design said. This pr fixes the issue. No, not released Improve the test by adding: ensure the semi-join exists if the runtime filter use in-subquery code path. Closes apache#35998 from ulysses-you/SPARK-32268-FOllOWUP. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c0c52dd) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-32268][SQL][FOLLOWUP] Add ColumnPruning in injectBloomFilter Add `ColumnPruning` in `InjectRuntimeFilter.injectBloomFilter` to optimize the BoomFilter creation query. It seems BloomFilter subqueries injected by `InjectRuntimeFilter` will read as many columns as filterCreationSidePlan. This does not match "Only scan the required columns" as the design said. We can check this by a simple case in `InjectRuntimeFilterSuite`: ```scala withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_ENABLED.key -> "true", SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") { val query = "select * from bf1 join bf2 on bf1.c1 = bf2.c2 where bf2.a2 = 62" sql(query).explain() } ``` The reason is subqueries have not been optimized by `ColumnPruning`, and this pr will fix it. No, not released Improve the test by adding `columnPruningTakesEffect` to check the optimizedPlan of bloom filter join. Closes apache#36047 from Flyangz/SPARK-32268-FOllOWUP. Authored-by: Yang Liu <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit c98725a) Signed-off-by: Yuming Wang <[email protected]> * [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession This PR proposes: 1. Use the function registry in the Spark Session being used 2. Move function registration into `beforeAll` Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1. No, test-only. Unittests fixed. Closes apache#36576 from HyukjinKwon/SPARK-32268-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit c5351f8) Signed-off-by: Hyukjin Kwon <[email protected]>

* [SPARK-39857][SQL] V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate (#535) ### What changes were proposed in this pull request? When building V2 `In` Predicate in `V2ExpressionBuilder`, `InSet.dataType` (which is `BooleanType`) is used to build the `LiteralValue`, `InSet.child.dataType` should be used instead. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#37271 from huaxingao/inset. Authored-by: huaxingao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: huaxingao <[email protected]> * [SPARK-32268][SQL] Row-level Runtime Filtering * [SPARK-32268][SQL] Row-level Runtime Filtering This PR proposes row-level runtime filters in Spark to reduce intermediate data volume for operators like shuffle, join and aggregate, and hence improve performance. We propose two mechanisms to do this: semi-join filters or bloom filters, and both mechanisms are proposed to co-exist side-by-side behind feature configs. [Design Doc](https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit?usp=sharing) with more details. With Semi-Join, we see 9 queries improve for the TPC DS 3TB benchmark, and no regressions. With Bloom Filter, we see 10 queries improve for the TPC DS 3TB benchmark, and no regressions. No Added tests Closes apache#35789 from somani/rf. Lead-authored-by: Abhishek Somani <[email protected]> Co-authored-by: Abhishek Somani <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 1f4e4c8) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-32268][TESTS][FOLLOWUP] Fix `BloomFilterAggregateQuerySuite` failed in ansi mode `Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](apache#30260), so the content of `exception.getMessage` is different from that of non-ans mode. This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent. Bug fix. No - Pass GA - Local Test **Before** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 23 seconds, 537 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` - Test that might_contain errors out non-constant Bloom filter *** FAILED *** "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch: cannot cast bigint to binary with ANSI mode on. If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false. ; line 2 pos 21; 'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)] +- SubqueryAlias t +- LocalRelation [a#2424L] " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171) ``` **After** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 26 seconds, 544 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 25 seconds, 289 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes apache#35953 from LuciferYang/SPARK-32268-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 7165123) Signed-off-by: Yuming Wang <[email protected]> * [SPARK-32268][SQL][FOLLOWUP] Add RewritePredicateSubquery below the InjectRuntimeFilter Add `RewritePredicateSubquery` below the `InjectRuntimeFilter` in `SparkOptimizer`. It seems if the runtime use in-subquery to do the filter, it won't be converted to semi-join as the design said. This pr fixes the issue. No, not released Improve the test by adding: ensure the semi-join exists if the runtime filter use in-subquery code path. Closes apache#35998 from ulysses-you/SPARK-32268-FOllOWUP. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c0c52dd) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-32268][SQL][FOLLOWUP] Add ColumnPruning in injectBloomFilter Add `ColumnPruning` in `InjectRuntimeFilter.injectBloomFilter` to optimize the BoomFilter creation query. It seems BloomFilter subqueries injected by `InjectRuntimeFilter` will read as many columns as filterCreationSidePlan. This does not match "Only scan the required columns" as the design said. We can check this by a simple case in `InjectRuntimeFilterSuite`: ```scala withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_ENABLED.key -> "true", SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") { val query = "select * from bf1 join bf2 on bf1.c1 = bf2.c2 where bf2.a2 = 62" sql(query).explain() } ``` The reason is subqueries have not been optimized by `ColumnPruning`, and this pr will fix it. No, not released Improve the test by adding `columnPruningTakesEffect` to check the optimizedPlan of bloom filter join. Closes apache#36047 from Flyangz/SPARK-32268-FOllOWUP. Authored-by: Yang Liu <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit c98725a) Signed-off-by: Yuming Wang <[email protected]> * [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession This PR proposes: 1. Use the function registry in the Spark Session being used 2. Move function registration into `beforeAll` Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1. No, test-only. Unittests fixed. Closes apache#36576 from HyukjinKwon/SPARK-32268-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit c5351f8) Signed-off-by: Hyukjin Kwon <[email protected]> * KE-29673 add segment prune function for bloom runtime filter fix min/max for UTF8String collection valid the runtime filter if need when broadcast join is valid * AL-6084 in Cast for method of canCast, when DecimalType cast to DoubleType add transformable logic (#542) * AL-6084 in Cast for method of canCast, when DecimalType cast DecimalType to DoubleType add suit logical Signed-off-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> Co-authored-by: Zhixiong Chen <[email protected]> Co-authored-by: huaxingao <[email protected]> Co-authored-by: Bowen Song <[email protected]>

* [SPARK-32268][SQL] Row-level Runtime Filtering This PR proposes row-level runtime filters in Spark to reduce intermediate data volume for operators like shuffle, join and aggregate, and hence improve performance. We propose two mechanisms to do this: semi-join filters or bloom filters, and both mechanisms are proposed to co-exist side-by-side behind feature configs. [Design Doc](https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit?usp=sharing) with more details. With Semi-Join, we see 9 queries improve for the TPC DS 3TB benchmark, and no regressions. With Bloom Filter, we see 10 queries improve for the TPC DS 3TB benchmark, and no regressions. No Added tests Closes apache#35789 from somani/rf. Lead-authored-by: Abhishek Somani <[email protected]> Co-authored-by: Abhishek Somani <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 1f4e4c8) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-32268][TESTS][FOLLOWUP] Fix `BloomFilterAggregateQuerySuite` failed in ansi mode `Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](apache#30260), so the content of `exception.getMessage` is different from that of non-ans mode. This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent. Bug fix. No - Pass GA - Local Test **Before** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 23 seconds, 537 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` - Test that might_contain errors out non-constant Bloom filter *** FAILED *** "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch: cannot cast bigint to binary with ANSI mode on. If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false. ; line 2 pos 21; 'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)] +- SubqueryAlias t +- LocalRelation [a#2424L] " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171) ``` **After** ``` export SPARK_ANSI_SQL_MODE=false mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 26 seconds, 544 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ``` export SPARK_ANSI_SQL_MODE=true mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite ``` ``` Run completed in 25 seconds, 289 milliseconds. Total number of tests run: 8 Suites: completed 2, aborted 0 Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes apache#35953 from LuciferYang/SPARK-32268-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 7165123) Signed-off-by: Yuming Wang <[email protected]> * [SPARK-32268][SQL][FOLLOWUP] Add RewritePredicateSubquery below the InjectRuntimeFilter Add `RewritePredicateSubquery` below the `InjectRuntimeFilter` in `SparkOptimizer`. It seems if the runtime use in-subquery to do the filter, it won't be converted to semi-join as the design said. This pr fixes the issue. No, not released Improve the test by adding: ensure the semi-join exists if the runtime filter use in-subquery code path. Closes apache#35998 from ulysses-you/SPARK-32268-FOllOWUP. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c0c52dd) Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-32268][SQL][FOLLOWUP] Add ColumnPruning in injectBloomFilter Add `ColumnPruning` in `InjectRuntimeFilter.injectBloomFilter` to optimize the BoomFilter creation query. It seems BloomFilter subqueries injected by `InjectRuntimeFilter` will read as many columns as filterCreationSidePlan. This does not match "Only scan the required columns" as the design said. We can check this by a simple case in `InjectRuntimeFilterSuite`: ```scala withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_ENABLED.key -> "true", SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") { val query = "select * from bf1 join bf2 on bf1.c1 = bf2.c2 where bf2.a2 = 62" sql(query).explain() } ``` The reason is subqueries have not been optimized by `ColumnPruning`, and this pr will fix it. No, not released Improve the test by adding `columnPruningTakesEffect` to check the optimizedPlan of bloom filter join. Closes apache#36047 from Flyangz/SPARK-32268-FOllOWUP. Authored-by: Yang Liu <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit c98725a) Signed-off-by: Yuming Wang <[email protected]> * [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession This PR proposes: 1. Use the function registry in the Spark Session being used 2. Move function registration into `beforeAll` Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1. No, test-only. Unittests fixed. Closes apache#36576 from HyukjinKwon/SPARK-32268-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit c5351f8) Signed-off-by: Hyukjin Kwon <[email protected]>

gengliangwang requested review from HyukjinKwon, cloud-fan and maropu November 5, 2020 07:56

github-actions bot added the SQL label Nov 5, 2020

gengliangwang changed the title ~~[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode~~ [WIP][SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode Nov 5, 2020

gengliangwang changed the title ~~[WIP][SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode~~ [SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode Nov 5, 2020

ansi canCast

18b49bf

gengliangwang force-pushed the ansiCanCast branch from d514f86 to 18b49bf Compare November 9, 2020 15:53

maropu reviewed Nov 10, 2020

View reviewed changes

gengliangwang added 3 commits November 11, 2020 19:16

add comment; rename requiredAnsiEnabledForOverflowTestCases

f74c488

reorder

7bfb1a6

upper case

e6faf4b

update doc

33452cd

github-actions bot added the DOCS label Nov 11, 2020

maropu closed this in 9a4c790 Nov 19, 2020

revans2 mentioned this pull request Nov 19, 2020

[BUG] ansi_cast tests are failing in 3.1.0 NVIDIA/spark-rapids#1164

Closed

This was referenced Nov 20, 2020

[SPARK-33496][SQL]Improve error message of ANSI explicit cast #30440

Closed

[SPARK-33549][SQL] Remove configuration spark.sql.legacy.allowCastNumericToTimestamp #30493

Closed

gengliangwang mentioned this pull request Jan 14, 2021

[SPARK-33354][FOLLOWUP][DOC] Shorten the table width of ANSI compliance casting document #31180

Closed

gengliangwang mentioned this pull request Mar 30, 2021

[SPARK-34856][FOLLOWUP][SQL] Remove dead code from AnsiCast.typeCheckFailureMessage #32004

Closed

This was referenced Mar 23, 2022

[SPARK-32268][SQL] Row-level Runtime Filtering #35789

Closed

[SPARK-32268][TESTS][FOLLOWUP] Fix BloomFilterAggregateQuerySuite failed in ansi mode #35953

Closed


		override protected val ansiEnabled: Boolean = SQLConf.get.ansiEnabled

		override def canCast(from: DataType, to: DataType): Boolean = if (ansiEnabled) {

	case (StringType, _: NumericType) => true
	case (BooleanType, _: NumericType) => true
	case (DateType, _: NumericType) => true
	case (TimestampType, _: NumericType) => true
	case (_: NumericType, _: NumericType) => true

[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode #30260

[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode #30260

Uh oh!

Conversation

gengliangwang commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

maropu commented Nov 6, 2020

Uh oh!

SparkQA commented Nov 9, 2020

Uh oh!

SparkQA commented Nov 9, 2020

Uh oh!

SparkQA commented Nov 9, 2020

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maropu Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

gengliangwang commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

gengliangwang commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

maropu commented Nov 19, 2020

Uh oh!

gengliangwang commented Nov 5, 2020 •

edited

Loading