[SPARK-25388][Test][SQL] Detect incorrect nullable of DataType in the result #22375

kiszk · 2018-09-10T01:54:29Z

What changes were proposed in this pull request?

This PR can correctly cause assertion failure when incorrect nullable of DataType in the result is generated by a target function to be tested.

Let us think the following example. In the future, a developer would write incorrect code that returns unexpected result. We have to correctly cause fail in this test since valueContainsNull=false while expr includes null. However, without this PR, this test passes. This PR can correctly cause fail.

test("test TARGETFUNCTON") {
  val expr = TARGETMAPFUNCTON()
  // expr = UnsafeMap(3 -> 6, 7 -> null)
  // expr.dataType = (IntegerType, IntegerType, false)

  expected = Map(3 -> 6, 7 -> null)
  checkEvaluation(expr, expected)

In checkEvaluationWithUnsafeProjection, the results are compared using UnsafeRow. When the given expected is converted) to UnsafeRow using the DataType of expr.

val expectedRow = UnsafeProjection.create(Array(expression.dataType, expression.dataType)).apply(lit)

In summary, expr is [0,1800000038,5000000038,18,2,0,700000003,2,0,6,18,2,0,700000003,2,0,6] with and w/o this PR. expected is converted to

w/o this PR, [0,1800000038,5000000038,18,2,0,700000003,2,0,6,18,2,0,700000003,2,0,6]
with this PR, [0,1800000038,5000000038,18,2,0,700000003,2,2,6,18,2,0,700000003,2,2,6]

As a result, w/o this PR, the test unexpectedly passes.

This is because, w/o this PR, based on given dataType, generated code of projection for expected avoids to set nullbit.

                    // tmpInput_2 is expected
/* 155 */           for (int index_1 = 0; index_1 < numElements_1; index_1++) {
/* 156 */             mutableStateArray_1[1].write(index_1, tmpInput_2.getInt(index_1));
/* 157 */           }

With this PR, generated code of projection for expected always checks whether nullbit should be set by isNullAt

                    // tmpInput_2 is expected
/* 161 */           for (int index_1 = 0; index_1 < numElements_1; index_1++) {
/* 162 */
/* 163 */             if (tmpInput_2.isNullAt(index_1)) {
/* 164 */               mutableStateArray_1[1].setNull4Bytes(index_1);
/* 165 */             } else {
/* 166 */               mutableStateArray_1[1].write(index_1, tmpInput_2.getInt(index_1));
/* 167 */             }
/* 168 */
/* 169 */           }

How was this patch tested?

Existing UTs

maropu · 2018-09-10T02:14:25Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

Why did you remove the existing test unsafeRow != expectedRow?

Thank you, it was used for debugging.

SparkQA · 2018-09-10T02:19:22Z

Test build #95853 has finished for PR 22375 at commit 48e47b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-10T07:05:02Z

Test build #95861 has finished for PR 22375 at commit 51aa9d5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-09-10T16:35:35Z

retest this please

SparkQA · 2018-09-10T23:18:18Z

Test build #95886 has finished for PR 22375 at commit 51aa9d5.

This patch fails from timeout after a configured wait of `400m`.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-09-11T02:13:46Z

retest this please

SparkQA · 2018-09-11T06:11:23Z

Test build #95911 has finished for PR 22375 at commit 51aa9d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-11T13:49:43Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

Add some comments explaining it? Not so straightforward.

Sure, I will add some comments from the description

viirya · 2018-09-11T13:58:19Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

Can we add a test for this change?

Thank you for your comment. Do you think what test is preferable?

Successfully pass (we already have these UTs since all UTs have been passed)

Expectedly failure (In other words, add a function that generated incorrect result)

I'd like to add a test that will be failed without this patch.

Since this patch allows us to correctly detect incorrect results, I added a test that was passed w/o this patch and that is failed with this patch.

oh, yes. this is going to detect such failure.

SparkQA · 2018-09-13T11:40:41Z

Test build #96034 has finished for PR 22375 at commit 04a2988.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class MapIncorrectDataTypeExpression() extends LeafExpression with CodegenFallback

kiszk · 2018-09-13T12:43:14Z

cc @cloud-fan @mgaido91

mgaido91 · 2018-09-13T13:23:24Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelperSuite.scala

what happens if here you put Map(3 -> 7, 6 -> -1)?

Here is an output. This is because the test correctly detects a failure even in codegen-off mode, too.

"Incorrect evaluation (codegen off): mapincorrectdatatypeexpression(), actual: keys: [3,6], values: [7,null], expected: keys: [3,6], values: [7,-1]" did not contain "Incorrect evaluation in unsafe mode" ScalaTestFailureLocation: org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelperSuite$$anonfun$4 at (ExpressionEvalHelperSuite.scala:44) org.scalatest.exceptions.TestFailedException: "Incorrect evaluation (codegen off): mapincorrectdatatypeexpression(), actual: keys: [3,6], values: [7,null], expected: keys: [3,6], values: [7,-1]" did not contain "Incorrect evaluation in unsafe mode" at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) ...

I see, so the example above passes in codegen off and fails with codegen on with this fix, while using Map(3 -> 7, 6 -> -1) passes codegen on and fails codegen off, am I right?

What I am thinking about (but I have not yet found a working implementation) is: since the problem arise when we say we expect null in a non-nullable datatype, can we add such a check? I mean, instead of pretending the expected value to be nullable, can't we add a check in case it is not nullable for being sure that it does not contain null? I think it would be better, because we would be able to distinguish a failure caused by a bad test, ie. a test written wrongly, from a UT failure caused by a bug in what we are testing. What do you think?

Here is summary.

Map(3 -> 7, 6 -> null) passes in codegen off and fails in codegen on with this fix

Map(3 -> 7, 6 -> -1) fails in codegen off and fails in codegen on

Would it be possible to share examples of two cases that you think we would be able to distinguish?

yes, thanks for the summary, it states more clearly what I thought.

My point is that this fix works properly only when we test both codegen on and off, but it would fail to detect the error condition it claims to fix if only one of them (for any reason) is tested. So I am wondering if it is possible to perform a check on the expected value, instead of this fix. Something like:

assert(containsNull(expected) && isNullable(expression.dataType))

where containsNull and isNullable have to be defined properly. In this way we should fail properly independently from whether codegen is on or not. And we can also give a more clear hint in the error message about the problem being most likely a bad UT.

Even if we make it checking recursively, I think that this case cannot be detected. This is because the mismatch occurs in the different recursive path.

Would it be possible to share the case where we distingished a wrong output from a bad written UT in other places, as you proposed?

Yes, I said that the suggestion above is wrong and needs to be rewritten in a recursive way. Sorry for the bad suggestion, I just meant to show my idea. So it should be something like:

assert(!containsNullWhereNotNullable(expected, expression.dataType))

I may not still understand your motivation correctly. What is the motivation to introduce this assertion?

The motivations are the 2 mentioned above. Basically, I am proposing the same suggestion @cloud-fan has just commented here

With some hints from @ueshin, this PR implemented the check of null value with nullable bit in checkResult().

cloud-fan · 2018-09-21T10:27:35Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

I think a more straightforward approach is, validate the expected according to the nullability of the given expression.

SparkQA · 2018-09-28T08:49:58Z

Test build #96741 has finished for PR 22375 at commit f8d3aeb.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T09:15:38Z

Test build #96742 has finished for PR 22375 at commit be28ab3.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T09:45:00Z

Test build #96745 has finished for PR 22375 at commit 33e589d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T14:01:21Z

Test build #96746 has finished for PR 22375 at commit 9ef335d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-10-03T16:03:07Z

ping @cloud-fan @mgaido91

mgaido91 · 2018-10-04T09:48:58Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

+      exprNullable: Boolean): Boolean = {
    val dataType = UserDefinedType.sqlType(exprDataType)

+    assert(result != null || exprNullable)


Can we add a description which is more clear about the issue? Something like: The result is null for a non-nullable expression?

Sure, I will add the description

we should add message to the assert, e.g. assert(result != null || exprNullable, "xxx")

mgaido91 · 2018-10-04T09:49:12Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

        val input = if (inputRow == EmptyRow) "" else s", input: $inputRow"

+        val dataType = expression.dataType
+        if (!checkResult(unsafeRow.get(0, dataType), expected, dataType, expression.nullable)) {


why did you add this?

This is because this statement checks consistency between expression and its nullable, as you proposed.

mmmh, I am not sure about this. Do we then still need the code below? Seems to me we are checking the same thing twice, please correct me if I am wrong.

We check different properties in these two if statements.

Line 231 checks consistency between value and nullable in expected

Line 245 checks bit-wise value between expected and expression

yes, I just meant that here we are checking the result and we are doing the same after too. Shouldn't we just add an assert for unsafeRow.get(0, dataType) != null || expression.nullable here instead?

sees only expected. 2. sees expected and expression. Thus, we are doing different.

At 1, as we discussed, we need to check the consistency recursively. IIUC, unsafeRow.get(0, dataType) != null || expression.nullable does not perform checks recursively. Do I make a misunderstanding?

does not perform checks recursively

good point, I was not considering it. Then, do we need the check at https://github.com/apache/spark/pull/22375/files/9ef335d6e43a6ef7d253d0ed3564f95bd0278f71#diff-41747ec3f56901eb7bfb95d2a217e94dL231? Isn't it performed in checkResult?

I think checkResult already validates expression according to the nullability of the given expression at 1. Thus, if the expected is not correct, 2. will detect an incorrect point.

cloud-fan · 2018-10-05T07:58:08Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

    val expected = UTF8String.fromString("abc")

-    if (!checkResult(actual.head, expected, expressions.head.dataType)) {
+    if (!checkResult(actual.head, expected, expressions.head.dataType, expressions.head.nullable)) {


It's a little weird to ask the caller to provide both expected and exprNullable , and then use exprNullable to validate expected. Can we set a default value for exprNullable in checkResult?

That is another option that I thought. On the other hand, to set default has a risk to overlook a possible incosistency between value and nullable at top level of expected.

Do we use the default value at the all of callers of checkResult?

maybe we should provide an overload of checkResult that takes Expression, which provides dataType and nullable

cloud-fan · 2018-10-11T11:39:30Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

  /**
   * Check the equality between result of expression and expected value, it will handle
-   * Array[Byte], Spread[Double], MapData and Row.
+   * Array[Byte], Spread[Double], MapData and Row. Also check whether exprNullable is true


the comment doesn't match the method now

SparkQA · 2018-10-11T14:26:43Z

Test build #97248 has finished for PR 22375 at commit edc3d7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-11T21:12:36Z

Test build #97275 has finished for PR 22375 at commit 5f84e80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-12T03:14:49Z

thanks, merging to master!

… result ## What changes were proposed in this pull request? This PR can correctly cause assertion failure when incorrect nullable of DataType in the result is generated by a target function to be tested. Let us think the following example. In the future, a developer would write incorrect code that returns unexpected result. We have to correctly cause fail in this test since `valueContainsNull=false` while `expr` includes `null`. However, without this PR, this test passes. This PR can correctly cause fail. ``` test("test TARGETFUNCTON") { val expr = TARGETMAPFUNCTON() // expr = UnsafeMap(3 -> 6, 7 -> null) // expr.dataType = (IntegerType, IntegerType, false) expected = Map(3 -> 6, 7 -> null) checkEvaluation(expr, expected) ``` In [`checkEvaluationWithUnsafeProjection`](https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L208-L235), the results are compared using `UnsafeRow`. When the given `expected` is [converted](https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L226-L227)) to `UnsafeRow` using the `DataType` of `expr`. ``` val expectedRow = UnsafeProjection.create(Array(expression.dataType, expression.dataType)).apply(lit) ``` In summary, `expr` is `[0,1800000038,5000000038,18,2,0,700000003,2,0,6,18,2,0,700000003,2,0,6]` with and w/o this PR. `expected` is converted to * w/o this PR, `[0,1800000038,5000000038,18,2,0,700000003,2,0,6,18,2,0,700000003,2,0,6]` * with this PR, `[0,1800000038,5000000038,18,2,0,700000003,2,2,6,18,2,0,700000003,2,2,6]` As a result, w/o this PR, the test unexpectedly passes. This is because, w/o this PR, based on given `dataType`, generated code of projection for `expected` avoids to set nullbit. ``` // tmpInput_2 is expected /* 155 */ for (int index_1 = 0; index_1 < numElements_1; index_1++) { /* 156 */ mutableStateArray_1[1].write(index_1, tmpInput_2.getInt(index_1)); /* 157 */ } ``` With this PR, generated code of projection for `expected` always checks whether nullbit should be set by `isNullAt` ``` // tmpInput_2 is expected /* 161 */ for (int index_1 = 0; index_1 < numElements_1; index_1++) { /* 162 */ /* 163 */ if (tmpInput_2.isNullAt(index_1)) { /* 164 */ mutableStateArray_1[1].setNull4Bytes(index_1); /* 165 */ } else { /* 166 */ mutableStateArray_1[1].write(index_1, tmpInput_2.getInt(index_1)); /* 167 */ } /* 168 */ /* 169 */ } ``` ## How was this patch tested? Existing UTs Closes apache#22375 from kiszk/SPARK-25388. Authored-by: Kazuaki Ishizaki <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…es that can lead to null result ### What changes were proposed in this pull request? Add document to `ExpressionEvalHelper`, and ask people to explore all the cases that can lead to null results (including null in struct fields, array elements and map values). This PR also fixes `ComplexTypeSuite.GetArrayStructFields` to explore all the null cases. ### Why are the changes needed? It happened several times that we hit correctness bugs caused by wrong expression nullability. When writing unit tests, we usually don't test the nullability flag directly, and it's too late to add such tests for all expressions. In #22375, we extended the expression test framework, which checks the nullability flag when the expected result/field/element is null. This requires the test cases to explore all the cases that can lead to null results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? I reverted 5d296ed locally, and `ComplexTypeSuite` can catch the bug. Closes #29493 from cloud-fan/small. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

kiszk changed the title ~~[WIP][SPARK-25388][Test] Detect incorrect nullable of DataType in the result~~ [WIP][SPARK-25388][Test][SQL] Detect incorrect nullable of DataType in the result Sep 10, 2018

maropu reviewed Sep 10, 2018

View reviewed changes

viirya reviewed Sep 11, 2018

View reviewed changes

kiszk changed the title ~~[WIP][SPARK-25388][Test][SQL] Detect incorrect nullable of DataType in the result~~ [SPARK-25388][Test][SQL] Detect incorrect nullable of DataType in the result Sep 13, 2018

mgaido91 reviewed Sep 13, 2018

View reviewed changes

cloud-fan reviewed Sep 21, 2018

View reviewed changes

kiszk added 4 commits September 28, 2018 18:16

add consistency check with nullable into checkResult

11379e0

revert unexpected change

fc987aa

compare unsafeRow with expected using checkResult

884fd80

add newline

33e589d

kiszk force-pushed the SPARK-25388 branch from be28ab3 to 33e589d Compare September 28, 2018 09:27

fix build failure

9ef335d

mgaido91 reviewed Oct 4, 2018

View reviewed changes

cloud-fan reviewed Oct 5, 2018

View reviewed changes

address review comments

edc3d7c

cloud-fan reviewed Oct 11, 2018

View reviewed changes

address review comments

5f84e80

asfgit closed this in c9d7d83 Oct 12, 2018

cloud-fan mentioned this pull request Aug 21, 2020

[SPARK-32669][SQL][TEST] Expression unit tests should explore all cases that can lead to null result #29493

Closed

[SPARK-25388][Test][SQL] Detect incorrect nullable of DataType in the result #22375

[SPARK-25388][Test][SQL] Detect incorrect nullable of DataType in the result #22375

Uh oh!

Conversation

kiszk commented Sep 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

kiszk commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

kiszk commented Sep 11, 2018

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 13, 2018

Uh oh!

kiszk commented Sep 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

kiszk commented Oct 3, 2018

Uh oh!

Choose a reason for hiding this comment

kiszk commented Sep 10, 2018 •

edited

Loading

kiszk Sep 12, 2018 •

edited

Loading

cloud-fan Oct 8, 2018 •

edited

Loading