Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Oct 11, 2018

What changes were proposed in this pull request?

    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()

Before the PR, it returns both rows. After the fix, it returns Row ("abc", 1)). This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

How was this patch tested?

Added test cases

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97284 has finished for PR 22702 at commit a9359ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c)
Copy link
Contributor

@cloud-fan cloud-fan Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming a is null, then b is also null.
If c is null: a And (b Or c) -> null, And(a, c) -> null
If c is true: a And (b Or c) -> null, And(a, c) -> null
if c is false: a And (b Or c) -> null, And(a, c) -> false

So yes this is a bug, and we should rewrite it to If(IsNull(a), null, And(a, c)), because if a is null, the result is always null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is complicated, shall we put a comment to explain it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after more thoughts, a And (b Or c) should be better than If(IsNull(a), null, And(a, c)), as it's more likely to get pushed down to data source, so the changes here LGTM

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97288 has finished for PR 22702 at commit a9359ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the wrong example in the PR description?

- val df1 = Seq(("abc", 1), (null, 2)).toDF("col1", "col2")
+ val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")

case (a Or b) And c if !a.nullable && a.semanticEquals(Not(c)) => And(b, c)
case (a Or b) And c if !b.nullable && b.semanticEquals(Not(c)) => And(a, c)

case a Or (b And c) if !a.nullable && Not(a).semanticEquals(b) => Or(a, c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these shouldn't be a problem, since if a is true, then a Or b is true, regardless of b's value/nullability, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is when a is null, c is true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, sorry. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it is the other case where the change is not needed, right?
a And (b Or c) -> And(a, c) when a is null, And(a, c) returns null (I got a bit confused earlier, sorry).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a is null, And(a, c) returns null

This is not always the case, null && false is false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes you're right, this might be a problem indeed if the expression is inside a not. Sorry, thanks.

@SparkQA
Copy link

SparkQA commented Oct 13, 2018

Test build #97328 has finished for PR 22702 at commit ca3172f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Oct 13, 2018
…cation

## What changes were proposed in this pull request?
```Scala
    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

Closes #22702 from gatorsmile/fixBooleanSimplify2.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit c9ba59d)
Signed-off-by: gatorsmile <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 13, 2018
…cation

## What changes were proposed in this pull request?
```Scala
    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

Closes #22702 from gatorsmile/fixBooleanSimplify2.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit c9ba59d)
Signed-off-by: gatorsmile <[email protected]>
@gatorsmile
Copy link
Member Author

gatorsmile commented Oct 13, 2018

Thanks! Merged to master/2.4

@asfgit asfgit closed this in c9ba59d Oct 13, 2018
case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b)
case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c)
case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c)
// The following optimization is applicable only when the operands are nullable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: only when the operands are not nullable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

asfgit pushed a commit that referenced this pull request Oct 16, 2018
…ooleanSimplification

This PR is to backport #22702 to branch 2.3.

---

## What changes were proposed in this pull request?
```Scala
    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

Closes #22718 from gatorsmile/cherrypickSPARK-25714.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 16, 2018
…ooleanSimplification

This PR is to backport #22702 to branch 2.2.

---

## What changes were proposed in this pull request?
```Scala
    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

Closes #22719 from gatorsmile/cherrypickSpark-257142.2.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…cation

## What changes were proposed in this pull request?
```Scala
    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

Closes apache#22702 from gatorsmile/fixBooleanSimplify2.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Willymontaz pushed a commit to criteo-forks/spark that referenced this pull request Sep 26, 2019
…ooleanSimplification

This PR is to backport apache#22702 to branch 2.2.

---

## What changes were proposed in this pull request?
```Scala
    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

Closes apache#22719 from gatorsmile/cherrypickSpark-257142.2.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Willymontaz pushed a commit to criteo-forks/spark that referenced this pull request Sep 27, 2019
…ooleanSimplification

This PR is to backport apache#22702 to branch 2.2.

---

## What changes were proposed in this pull request?
```Scala
    val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
    df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1")
    val df2 = spark.read.parquet("/tmp/test1")
    df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
```

Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release.

## How was this patch tested?
Added test cases

Closes apache#22719 from gatorsmile/cherrypickSpark-257142.2.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants