Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Nov 5, 2020

What changes were proposed in this pull request?

In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types.
image

Comparing the ANSI CAST syntax rules with the current default behavior of Spark:
image

To make Spark's ANSI mode more ANSI SQL Compatible,I propose to disallow the following casting in ANSI mode:

TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct

The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now

Numeric <=> Boolean
String <=> Binary

Why are the changes needed?

Better ANSI SQL compliance

Does this PR introduce any user-facing change?

Yes, the following castings will not be allowed in ANSI mode:

TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct

How was this patch tested?

Unit test

The ANSI Compliance doc preview:
image

@github-actions github-actions bot added the SQL label Nov 5, 2020
@gengliangwang gengliangwang changed the title [SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode [WIP][SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode Nov 5, 2020
@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Test build #130642 has finished for PR 30260 at commit fe0fe48.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35252/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35252/

@gengliangwang gengliangwang changed the title [WIP][SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode [SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode Nov 5, 2020
@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35261/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35261/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Test build #130650 has finished for PR 30260 at commit d514f86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Nov 6, 2020

Thanks for the request, @gengliangwang ! I'll review this later.

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35405/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35405/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Test build #130796 has finished for PR 30260 at commit 18b49bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we need to update the migration guide, too.


override protected val ansiEnabled: Boolean = SQLConf.get.ansiEnabled

override def canCast(from: DataType, to: DataType): Boolean = if (ansiEnabled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about describing this new behaviour in the usage above of ExpressionDescription?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, then we need to mention about the behavior of throwing overflow exceptions when ANSI flag enabled. I will add some content in the sql-ref-ansi-compliance.md

override def canCast(from: DataType, to: DataType): Boolean = AnsiCast.canCast(from, to)
}

object AnsiCast {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave some comments to summarize the current behaivour of the ANSI explicit cast as described in the PR description (references section 6.13 of the ANSI SQL standard and differences from the standard, e.g., Numeric <=> Boolean) ?

Comment on lines +1795 to +1797
case (_: NumericType, _: NumericType) => true
case (StringType, _: NumericType) => true
case (BooleanType, _: NumericType) => true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a suggestion) For readability, could you reorder these entries according to Cast.canCast where possible? For example, the numeric entries for Cast seems to be placed just before complex types:

case (StringType, _: NumericType) => true
case (BooleanType, _: NumericType) => true
case (DateType, _: NumericType) => true
case (TimestampType, _: NumericType) => true
case (_: NumericType, _: NumericType) => true

checkEvaluation(cast("abcd", DecimalType(38, 1)), null)
}

test("SPARK-22825 Cast array to string") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit and this is not related to this PR though...) SPARK-22825 -> SPARK-22825:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer keeping them unchanged in this PR

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35543/

@github-actions github-actions bot added the DOCS label Nov 11, 2020
@gengliangwang
Copy link
Member Author

@maropu thanks for the suggestions. I have updated the code. Please take another look.

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35543/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35546/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35546/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35888/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131283 has finished for PR 30260 at commit 6003bef.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AnsiCastSuiteWithAnsiModeOn extends AnsiCastSuiteBase
  • class AnsiCastSuiteWithAnsiModeOff extends AnsiCastSuiteBase

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35902/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35902/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131300 has finished for PR 30260 at commit 6003bef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AnsiCastSuiteWithAnsiModeOn extends AnsiCastSuiteBase
  • class AnsiCastSuiteWithAnsiModeOff extends AnsiCastSuiteBase

@maropu maropu closed this in 9a4c790 Nov 19, 2020
@maropu
Copy link
Member

maropu commented Nov 19, 2020

Thanks! Merged to master.

@dongjoon-hyun
Copy link
Member

Thank you, @gengliangwang and all!

@revans2
Copy link
Contributor

revans2 commented Nov 19, 2020

I just noticed that the error messages when trying to cast between timestamps and numeric types is wrong now in ANSI mode.

org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`c0` AS TIMESTAMP)' due to data type mismatch: cannot cast int to timestamp,you can enable the casting by setting spark.sql.legacy.allowCastNumericToTimestamp to true,but we strongly recommend using function TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS instead.;;

spark.sql.legacy.allowCastNumericToTimestamp has no impact on ANSI casts after this change. It looks like a fairly simple fix.

if (child.dataType.isInstanceOf[NumericType] && dataType.isInstanceOf[TimestampType]) {

@gengliangwang
Copy link
Member Author

@revans2 Thanks for reporting the problem. Actually, I am working on another PR to give suggestions for the disallowed ANSI cast. It will be sent out today and I will ping you in the PR.

@revans2
Copy link
Contributor

revans2 commented Nov 19, 2020

@gengliangwang sounds good. Thanks

cloud-fan pushed a commit that referenced this pull request Nov 25, 2020
…ericToTimestamp

### What changes were proposed in this pull request?

Remove SQL configuration spark.sql.legacy.allowCastNumericToTimestamp

### Why are the changes needed?

In the current master branch, there is a new configuration `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast Numeric types to Timestamp or not. The default value is true.

After #30260, the type conversion between Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for disallowing the conversion. Users just need to set `spark.sql.ansi.enabled` for the behavior.

As the configuration is not in any released yet, we should remove the configuration to make things simpler.

### Does this PR introduce _any_ user-facing change?

No, since the configuration is not released yet.

### How was this patch tested?

Existing test cases

Closes #30493 from gengliangwang/LEGACY_ALLOW_CAST_NUMERIC_TO_TIMESTAMP.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
gengliangwang added a commit that referenced this pull request Nov 25, 2020
### What changes were proposed in this pull request?

After #30260, there are some type conversions disallowed under ANSI mode.
We should tell users what they can do if they have to use the disallowed casting.

### Why are the changes needed?

Make it more user-friendly.

### Does this PR introduce _any_ user-facing change?

Yes, the error message is improved on casting failure when ANSI mode is enabled
### How was this patch tested?

Unit tests.

Closes #30440 from gengliangwang/improveAnsiCastErrorMSG.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Nov 25, 2020
### What changes were proposed in this pull request?

After apache/spark#30260, there are some type conversions disallowed under ANSI mode.
We should tell users what they can do if they have to use the disallowed casting.

### Why are the changes needed?

Make it more user-friendly.

### Does this PR introduce _any_ user-facing change?

Yes, the error message is improved on casting failure when ANSI mode is enabled
### How was this patch tested?

Unit tests.

Closes #30440 from gengliangwang/improveAnsiCastErrorMSG.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jan 14, 2021
…ce casting document

### What changes were proposed in this pull request?

This is a follow-up of #30260
It shortens the table width of ANSI compliance casting document.

### Why are the changes needed?

The table is too wide and the UI of doc site is broken if we scroll the page to right side.
![Screen Shot 2021-01-14 at 3 04 57 PM](https://user-images.githubusercontent.com/1097932/104565897-d2693b80-5601-11eb-9f93-5f603cfc94c1.png)

### Does this PR introduce _any_ user-facing change?

Minor document change

### How was this patch tested?

Build doc site locally and preview:
![Screen Shot 2021-01-14 at 4 44 30 PM](https://user-images.githubusercontent.com/1097932/104565814-b2d21300-5601-11eb-94c4-78c785cda8ed.png)

Closes #31180 from gengliangwang/reviseAnsiDocStyle.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit feedd1b)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jan 14, 2021
…ce casting document

### What changes were proposed in this pull request?

This is a follow-up of #30260
It shortens the table width of ANSI compliance casting document.

### Why are the changes needed?

The table is too wide and the UI of doc site is broken if we scroll the page to right side.
![Screen Shot 2021-01-14 at 3 04 57 PM](https://user-images.githubusercontent.com/1097932/104565897-d2693b80-5601-11eb-9f93-5f603cfc94c1.png)

### Does this PR introduce _any_ user-facing change?

Minor document change

### How was this patch tested?

Build doc site locally and preview:
![Screen Shot 2021-01-14 at 4 44 30 PM](https://user-images.githubusercontent.com/1097932/104565814-b2d21300-5601-11eb-94c4-78c785cda8ed.png)

Closes #31180 from gengliangwang/reviseAnsiDocStyle.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
…ce casting document

### What changes were proposed in this pull request?

This is a follow-up of apache#30260
It shortens the table width of ANSI compliance casting document.

### Why are the changes needed?

The table is too wide and the UI of doc site is broken if we scroll the page to right side.
![Screen Shot 2021-01-14 at 3 04 57 PM](https://user-images.githubusercontent.com/1097932/104565897-d2693b80-5601-11eb-9f93-5f603cfc94c1.png)

### Does this PR introduce _any_ user-facing change?

Minor document change

### How was this patch tested?

Build doc site locally and preview:
![Screen Shot 2021-01-14 at 4 44 30 PM](https://user-images.githubusercontent.com/1097932/104565814-b2d21300-5601-11eb-94c4-78c785cda8ed.png)

Closes apache#31180 from gengliangwang/reviseAnsiDocStyle.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
wangyum pushed a commit that referenced this pull request Mar 23, 2022
…ailed in ansi mode

### What changes were proposed in this pull request?
`Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](#30260),  so the content of  `exception.getMessage` is different from that of non-ans mode.

This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent.

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA
- Local Test

**Before**

```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 23 seconds, 537 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
- Test that might_contain errors out non-constant Bloom filter *** FAILED ***
  "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch:
   cannot cast bigint to binary with ANSI mode on.
   If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false.
  ; line 2 pos 21;
  'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)]
  +- SubqueryAlias t
     +- LocalRelation [a#2424L]
  " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171)
```

**After**
```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 26 seconds, 544 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 25 seconds, 289 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #35953 from LuciferYang/SPARK-32268-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
wangyum pushed a commit that referenced this pull request Mar 23, 2022
…ailed in ansi mode

### What changes were proposed in this pull request?
`Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](#30260),  so the content of  `exception.getMessage` is different from that of non-ans mode.

This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent.

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA
- Local Test

**Before**

```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 23 seconds, 537 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
- Test that might_contain errors out non-constant Bloom filter *** FAILED ***
  "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch:
   cannot cast bigint to binary with ANSI mode on.
   If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false.
  ; line 2 pos 21;
  'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)]
  +- SubqueryAlias t
     +- LocalRelation [a#2424L]
  " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171)
```

**After**
```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 26 seconds, 544 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 25 seconds, 289 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #35953 from LuciferYang/SPARK-32268-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 7165123)
Signed-off-by: Yuming Wang <[email protected]>
songzhxlh-max pushed a commit to songzhxlh-max/spark that referenced this pull request Oct 12, 2022
* [SPARK-32268][SQL] Row-level Runtime Filtering

This PR proposes row-level runtime filters in Spark to reduce intermediate data volume for operators like shuffle, join and aggregate, and hence improve performance. We propose two mechanisms to do this: semi-join filters or bloom filters, and both mechanisms are proposed to co-exist side-by-side behind feature configs.
[Design Doc](https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit?usp=sharing) with more details.

With Semi-Join, we see 9 queries improve for the TPC DS 3TB benchmark, and no regressions.
With Bloom Filter, we see 10 queries improve for the TPC DS 3TB benchmark, and no regressions.

No

Added tests

Closes apache#35789 from somani/rf.

Lead-authored-by: Abhishek Somani <[email protected]>
Co-authored-by: Abhishek Somani <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 1f4e4c8)
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-32268][TESTS][FOLLOWUP] Fix `BloomFilterAggregateQuerySuite` failed in ansi mode

`Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](apache#30260),  so the content of  `exception.getMessage` is different from that of non-ans mode.

This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent.

Bug fix.

No

- Pass GA
- Local Test

**Before**

```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 23 seconds, 537 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
- Test that might_contain errors out non-constant Bloom filter *** FAILED ***
  "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch:
   cannot cast bigint to binary with ANSI mode on.
   If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false.
  ; line 2 pos 21;
  'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)]
  +- SubqueryAlias t
     +- LocalRelation [a#2424L]
  " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171)
```

**After**
```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 26 seconds, 544 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 25 seconds, 289 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes apache#35953 from LuciferYang/SPARK-32268-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 7165123)
Signed-off-by: Yuming Wang <[email protected]>

* [SPARK-32268][SQL][FOLLOWUP] Add RewritePredicateSubquery below the InjectRuntimeFilter

Add `RewritePredicateSubquery` below the `InjectRuntimeFilter` in `SparkOptimizer`.

It seems if the runtime use in-subquery to do the filter, it won't be converted to semi-join as the design said.

This pr fixes the issue.

No, not released

Improve the test by adding: ensure the semi-join exists if the runtime filter use in-subquery code path.

Closes apache#35998 from ulysses-you/SPARK-32268-FOllOWUP.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c0c52dd)
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-32268][SQL][FOLLOWUP] Add ColumnPruning in injectBloomFilter

Add `ColumnPruning` in `InjectRuntimeFilter.injectBloomFilter` to optimize the BoomFilter creation query.

It seems BloomFilter subqueries injected by `InjectRuntimeFilter` will read as many columns as filterCreationSidePlan. This does not match "Only scan the required columns" as the design said. We can check this by a simple case in `InjectRuntimeFilterSuite`:
```scala
withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_ENABLED.key -> "true",
  SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000",
  SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") {
  val query = "select * from bf1 join bf2 on bf1.c1 = bf2.c2 where bf2.a2 = 62"
  sql(query).explain()
}
```
The reason is subqueries have not been optimized by `ColumnPruning`, and this pr will fix it.

No, not released

Improve the test by adding `columnPruningTakesEffect` to check the optimizedPlan of bloom filter join.

Closes apache#36047 from Flyangz/SPARK-32268-FOllOWUP.

Authored-by: Yang Liu <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit c98725a)
Signed-off-by: Yuming Wang <[email protected]>

* [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

This PR proposes:
1. Use the function registry in the Spark Session being used
2. Move function registration into `beforeAll`

Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1.

No, test-only.

Unittests fixed.

Closes apache#36576 from HyukjinKwon/SPARK-32268-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit c5351f8)
Signed-off-by: Hyukjin Kwon <[email protected]>
songzhxlh-max added a commit to Kyligence/spark that referenced this pull request Oct 13, 2022
* [SPARK-39857][SQL] V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate (#535)

### What changes were proposed in this pull request?
When building V2 `In` Predicate in `V2ExpressionBuilder`, `InSet.dataType` (which is `BooleanType`) is used to build the `LiteralValue`, `InSet.child.dataType` should be used instead.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
new test

Closes apache#37271 from huaxingao/inset.

Authored-by: huaxingao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

Signed-off-by: Dongjoon Hyun <[email protected]>
Co-authored-by: huaxingao <[email protected]>

* [SPARK-32268][SQL] Row-level Runtime Filtering

* [SPARK-32268][SQL] Row-level Runtime Filtering

This PR proposes row-level runtime filters in Spark to reduce intermediate data volume for operators like shuffle, join and aggregate, and hence improve performance. We propose two mechanisms to do this: semi-join filters or bloom filters, and both mechanisms are proposed to co-exist side-by-side behind feature configs.
[Design Doc](https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit?usp=sharing) with more details.

With Semi-Join, we see 9 queries improve for the TPC DS 3TB benchmark, and no regressions.
With Bloom Filter, we see 10 queries improve for the TPC DS 3TB benchmark, and no regressions.

No

Added tests

Closes apache#35789 from somani/rf.

Lead-authored-by: Abhishek Somani <[email protected]>
Co-authored-by: Abhishek Somani <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 1f4e4c8)
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-32268][TESTS][FOLLOWUP] Fix `BloomFilterAggregateQuerySuite` failed in ansi mode

`Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](apache#30260),  so the content of  `exception.getMessage` is different from that of non-ans mode.

This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent.

Bug fix.

No

- Pass GA
- Local Test

**Before**

```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 23 seconds, 537 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
- Test that might_contain errors out non-constant Bloom filter *** FAILED ***
  "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch:
   cannot cast bigint to binary with ANSI mode on.
   If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false.
  ; line 2 pos 21;
  'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)]
  +- SubqueryAlias t
     +- LocalRelation [a#2424L]
  " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171)
```

**After**
```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 26 seconds, 544 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 25 seconds, 289 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes apache#35953 from LuciferYang/SPARK-32268-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 7165123)
Signed-off-by: Yuming Wang <[email protected]>

* [SPARK-32268][SQL][FOLLOWUP] Add RewritePredicateSubquery below the InjectRuntimeFilter

Add `RewritePredicateSubquery` below the `InjectRuntimeFilter` in `SparkOptimizer`.

It seems if the runtime use in-subquery to do the filter, it won't be converted to semi-join as the design said.

This pr fixes the issue.

No, not released

Improve the test by adding: ensure the semi-join exists if the runtime filter use in-subquery code path.

Closes apache#35998 from ulysses-you/SPARK-32268-FOllOWUP.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c0c52dd)
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-32268][SQL][FOLLOWUP] Add ColumnPruning in injectBloomFilter

Add `ColumnPruning` in `InjectRuntimeFilter.injectBloomFilter` to optimize the BoomFilter creation query.

It seems BloomFilter subqueries injected by `InjectRuntimeFilter` will read as many columns as filterCreationSidePlan. This does not match "Only scan the required columns" as the design said. We can check this by a simple case in `InjectRuntimeFilterSuite`:
```scala
withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_ENABLED.key -> "true",
  SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000",
  SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") {
  val query = "select * from bf1 join bf2 on bf1.c1 = bf2.c2 where bf2.a2 = 62"
  sql(query).explain()
}
```
The reason is subqueries have not been optimized by `ColumnPruning`, and this pr will fix it.

No, not released

Improve the test by adding `columnPruningTakesEffect` to check the optimizedPlan of bloom filter join.

Closes apache#36047 from Flyangz/SPARK-32268-FOllOWUP.

Authored-by: Yang Liu <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit c98725a)
Signed-off-by: Yuming Wang <[email protected]>

* [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

This PR proposes:
1. Use the function registry in the Spark Session being used
2. Move function registration into `beforeAll`

Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1.

No, test-only.

Unittests fixed.

Closes apache#36576 from HyukjinKwon/SPARK-32268-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit c5351f8)
Signed-off-by: Hyukjin Kwon <[email protected]>

* KE-29673 add segment prune function for bloom runtime filter

fix min/max for UTF8String collection

valid the runtime filter if need when broadcast join is valid

* AL-6084 in Cast for method of canCast, when DecimalType cast to DoubleType add transformable logic (#542)

* AL-6084 in Cast for method of canCast, when DecimalType cast DecimalType to DoubleType add suit logical

Signed-off-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Zhixiong Chen <[email protected]>
Co-authored-by: huaxingao <[email protected]>
Co-authored-by: Bowen Song <[email protected]>
leejaywei pushed a commit to Kyligence/spark that referenced this pull request Oct 18, 2022
* [SPARK-32268][SQL] Row-level Runtime Filtering

This PR proposes row-level runtime filters in Spark to reduce intermediate data volume for operators like shuffle, join and aggregate, and hence improve performance. We propose two mechanisms to do this: semi-join filters or bloom filters, and both mechanisms are proposed to co-exist side-by-side behind feature configs.
[Design Doc](https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit?usp=sharing) with more details.

With Semi-Join, we see 9 queries improve for the TPC DS 3TB benchmark, and no regressions.
With Bloom Filter, we see 10 queries improve for the TPC DS 3TB benchmark, and no regressions.

No

Added tests

Closes apache#35789 from somani/rf.

Lead-authored-by: Abhishek Somani <[email protected]>
Co-authored-by: Abhishek Somani <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 1f4e4c8)
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-32268][TESTS][FOLLOWUP] Fix `BloomFilterAggregateQuerySuite` failed in ansi mode

`Test that might_contain errors out non-constant Bloom filter` in `BloomFilterAggregateQuerySuite ` failed in ansi mode due to `Numeric <=> Binary` is [not allowed in ansi mode](apache#30260),  so the content of  `exception.getMessage` is different from that of non-ans mode.

This pr change the case to ensure that the error messages of `ansi` mode and `non-ansi` are consistent.

Bug fix.

No

- Pass GA
- Local Test

**Before**

```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 23 seconds, 537 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
- Test that might_contain errors out non-constant Bloom filter *** FAILED ***
  "cannot resolve 'CAST(t.a AS BINARY)' due to data type mismatch:
   cannot cast bigint to binary with ANSI mode on.
   If you have to cast bigint to binary, you can set spark.sql.ansi.enabled as false.
  ; line 2 pos 21;
  'Project [unresolvedalias('might_contain(cast(a#2424L as binary), cast(5 as bigint)), None)]
  +- SubqueryAlias t
     +- LocalRelation [a#2424L]
  " did not contain "The Bloom filter binary input to might_contain should be either a constant value or a scalar subquery expression" (BloomFilterAggregateQuerySuite.scala:171)
```

**After**
```
export SPARK_ANSI_SQL_MODE=false
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 26 seconds, 544 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

```

```
export SPARK_ANSI_SQL_MODE=true
mvn clean test -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.BloomFilterAggregateQuerySuite
```

```
Run completed in 25 seconds, 289 milliseconds.
Total number of tests run: 8
Suites: completed 2, aborted 0
Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes apache#35953 from LuciferYang/SPARK-32268-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 7165123)
Signed-off-by: Yuming Wang <[email protected]>

* [SPARK-32268][SQL][FOLLOWUP] Add RewritePredicateSubquery below the InjectRuntimeFilter

Add `RewritePredicateSubquery` below the `InjectRuntimeFilter` in `SparkOptimizer`.

It seems if the runtime use in-subquery to do the filter, it won't be converted to semi-join as the design said.

This pr fixes the issue.

No, not released

Improve the test by adding: ensure the semi-join exists if the runtime filter use in-subquery code path.

Closes apache#35998 from ulysses-you/SPARK-32268-FOllOWUP.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit c0c52dd)
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-32268][SQL][FOLLOWUP] Add ColumnPruning in injectBloomFilter

Add `ColumnPruning` in `InjectRuntimeFilter.injectBloomFilter` to optimize the BoomFilter creation query.

It seems BloomFilter subqueries injected by `InjectRuntimeFilter` will read as many columns as filterCreationSidePlan. This does not match "Only scan the required columns" as the design said. We can check this by a simple case in `InjectRuntimeFilterSuite`:
```scala
withSQLConf(SQLConf.RUNTIME_BLOOM_FILTER_ENABLED.key -> "true",
  SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000",
  SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000") {
  val query = "select * from bf1 join bf2 on bf1.c1 = bf2.c2 where bf2.a2 = 62"
  sql(query).explain()
}
```
The reason is subqueries have not been optimized by `ColumnPruning`, and this pr will fix it.

No, not released

Improve the test by adding `columnPruningTakesEffect` to check the optimizedPlan of bloom filter join.

Closes apache#36047 from Flyangz/SPARK-32268-FOllOWUP.

Authored-by: Yang Liu <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit c98725a)
Signed-off-by: Yuming Wang <[email protected]>

* [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession

This PR proposes:
1. Use the function registry in the Spark Session being used
2. Move function registration into `beforeAll`

Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1.

No, test-only.

Unittests fixed.

Closes apache#36576 from HyukjinKwon/SPARK-32268-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit c5351f8)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants