[SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification #22702

gatorsmile · 2018-10-11T21:43:19Z

What changes were proposed in this pull request?

    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()

Before the PR, it returns both rows. After the fix, it returns Row ("abc", 1)). This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

How was this patch tested?

Added test cases

SparkQA · 2018-10-11T23:46:21Z

Test build #97284 has finished for PR 22702 at commit a9359ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-12T00:30:55Z

retest this please

cloud-fan · 2018-10-12T02:27:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

-      case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-      case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-      case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+      case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c)


assuming a is null, then b is also null.
If c is null: a And (b Or c) -> null, And(a, c) -> null
If c is true: a And (b Or c) -> null, And(a, c) -> null
if c is false: a And (b Or c) -> null, And(a, c) -> false

So yes this is a bug, and we should rewrite it to If(IsNull(a), null, And(a, c)), because if a is null, the result is always null.

Since this is complicated, shall we put a comment to explain it?

after more thoughts, a And (b Or c) should be better than If(IsNull(a), null, And(a, c)), as it's more likely to get pushed down to data source, so the changes here LGTM

SparkQA · 2018-10-12T04:12:21Z

Test build #97288 has finished for PR 22702 at commit a9359ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Could you fix the wrong example in the PR description?

- val df1 = Seq(("abc", 1), (null, 2)).toDF("col1", "col2")
+ val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")

mgaido91 · 2018-10-12T08:13:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+      case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c)
+      case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c)
+
+      case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c)


these shouldn't be a problem, since if a is true, then a Or b is true, regardless of b's value/nullability, isn't it?

the problem is when a is null, c is true

I see now, sorry. Thanks.

Sorry, it is the other case where the change is not needed, right?
a And (b Or c) -> And(a, c) when a is null, And(a, c) returns null (I got a bit confused earlier, sorry).

when a is null, And(a, c) returns null

This is not always the case, null && false is false

oh, yes you're right, this might be a problem indeed if the expression is inside a not. Sorry, thanks.

SparkQA · 2018-10-13T03:42:51Z

Test build #97328 has finished for PR 22702 at commit ca3172f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…cation ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22702 from gatorsmile/fixBooleanSimplify2. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit c9ba59d) Signed-off-by: gatorsmile <[email protected]>

gatorsmile · 2018-10-13T04:05:17Z

Thanks! Merged to master/2.4

cloud-fan · 2018-10-13T05:48:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

-      case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
-      case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
-      case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
+      // The following optimization is applicable only when the operands are nullable,


typo: only when the operands are not nullable

…ooleanSimplification This PR is to backport #22702 to branch 2.3. --- ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22718 from gatorsmile/cherrypickSPARK-25714. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ooleanSimplification This PR is to backport #22702 to branch 2.2. --- ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22719 from gatorsmile/cherrypickSpark-257142.2. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…cation ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes apache#22702 from gatorsmile/fixBooleanSimplify2. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…ooleanSimplification This PR is to backport apache#22702 to branch 2.2. --- ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes apache#22719 from gatorsmile/cherrypickSpark-257142.2. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile added 2 commits October 11, 2018 14:38

fix

5d1dde1

style

a9359ab

cloud-fan reviewed Oct 12, 2018

View reviewed changes

dongjoon-hyun reviewed Oct 12, 2018

View reviewed changes

mgaido91 reviewed Oct 12, 2018

View reviewed changes

ADD COMMENTS.

ca3172f

asfgit closed this in c9ba59d Oct 13, 2018

cloud-fan reviewed Oct 13, 2018

View reviewed changes

This was referenced Oct 14, 2018

[SPARK-25714] [BACKPORT-2.3] Fix Null Handling in the Optimizer rule BooleanSimplification #22718

Closed

[SPARK-25714] [BACKPORT-2.2] Fix Null Handling in the Optimizer rule BooleanSimplification #22719

Closed

cloud-fan mentioned this pull request Oct 29, 2018

[SPARK-25860][SQL] Replace Literal(null, _) with FalseLiteral whenever possible #22857

Closed

[SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification #22702

[SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification #22702

Uh oh!

Conversation

gatorsmile commented Oct 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 11, 2018

Uh oh!

gatorsmile commented Oct 12, 2018

Uh oh!

cloud-fan Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2018

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 13, 2018

Uh oh!

gatorsmile commented Oct 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gatorsmile commented Oct 11, 2018 •

edited

Loading

cloud-fan Oct 12, 2018 •

edited

Loading

gatorsmile commented Oct 13, 2018 •

edited

Loading