-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification #22702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97284 has finished for PR 22702 at commit
|
|
retest this please |
| case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) | ||
| case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) | ||
| case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) | ||
| case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming a is null, then b is also null.
If c is null: a And (b Or c) -> null, And(a, c) -> null
If c is true: a And (b Or c) -> null, And(a, c) -> null
if c is false: a And (b Or c) -> null, And(a, c) -> false
So yes this is a bug, and we should rewrite it to If(IsNull(a), null, And(a, c)), because if a is null, the result is always null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is complicated, shall we put a comment to explain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after more thoughts, a And (b Or c) should be better than If(IsNull(a), null, And(a, c)), as it's more likely to get pushed down to data source, so the changes here LGTM
|
Test build #97288 has finished for PR 22702 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the wrong example in the PR description?
- val df1 = Seq(("abc", 1), (null, 2)).toDF("col1", "col2")
+ val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
| case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c) | ||
| case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c) | ||
|
|
||
| case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these shouldn't be a problem, since if a is true, then a Or b is true, regardless of b's value/nullability, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is when a is null, c is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, sorry. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it is the other case where the change is not needed, right?
a And (b Or c) -> And(a, c) when a is null, And(a, c) returns null (I got a bit confused earlier, sorry).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when a is null, And(a, c) returns null
This is not always the case, null && false is false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes you're right, this might be a problem indeed if the expression is inside a not. Sorry, thanks.
|
Test build #97328 has finished for PR 22702 at commit
|
…cation
## What changes were proposed in this pull request?
```Scala
val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
val df2 = spark.read.parquet("/tmp/test1")
df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```
Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.
## How was this patch tested?
Added test cases
Closes #22702 from gatorsmile/fixBooleanSimplify2.
Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit c9ba59d)
Signed-off-by: gatorsmile <[email protected]>
…cation
## What changes were proposed in this pull request?
```Scala
val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
val df2 = spark.read.parquet("/tmp/test1")
df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```
Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.
## How was this patch tested?
Added test cases
Closes #22702 from gatorsmile/fixBooleanSimplify2.
Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit c9ba59d)
Signed-off-by: gatorsmile <[email protected]>
|
Thanks! Merged to master/2.4 |
| case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) | ||
| case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) | ||
| case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) | ||
| // The following optimization is applicable only when the operands are nullable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: only when the operands are not nullable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…ooleanSimplification This PR is to backport #22702 to branch 2.3. --- ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22718 from gatorsmile/cherrypickSPARK-25714. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ooleanSimplification This PR is to backport #22702 to branch 2.2. --- ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22719 from gatorsmile/cherrypickSpark-257142.2. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…cation
## What changes were proposed in this pull request?
```Scala
val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
val df2 = spark.read.parquet("/tmp/test1")
df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```
Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.
## How was this patch tested?
Added test cases
Closes apache#22702 from gatorsmile/fixBooleanSimplify2.
Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
…ooleanSimplification This PR is to backport apache#22702 to branch 2.2. --- ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes apache#22719 from gatorsmile/cherrypickSpark-257142.2. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ooleanSimplification This PR is to backport apache#22702 to branch 2.2. --- ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes apache#22719 from gatorsmile/cherrypickSpark-257142.2. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Before the PR, it returns both rows. After the fix, it returns
Row ("abc", 1)). This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.How was this patch tested?
Added test cases