Skip to content

Conversation

@bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

Backport of #36903

Change Inline.eval to return a row of null values rather than a null row in the case of a null input struct.

Why are the changes needed?

Consider the following query:

set spark.sql.codegen.wholeStage=false;
select inline(array(named_struct('a', 1, 'b', 2), null));

This query fails with a NullPointerException:

22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)

(In Spark 3.1.x, you don't need to set spark.sql.codegen.wholeStage to false to reproduce the error, since Spark 3.1.x has no codegen path for Inline).

This query fails regardless of the setting of spark.sql.codegen.wholeStage:

val dfWide = (Seq((1))
  .toDF("col0")
  .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))

val df = (dfWide
  .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as struct_array"))

df.selectExpr("*", "inline(struct_array)").collect

It fails with

22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown Source)

When Inline.eval returns a null row in the collection, GenerateExec gets a NullPointerException either when joining the null row with required child output, or projecting the null row.

This PR avoids producing the null row and produces a row of null values instead:

spark-sql> set spark.sql.codegen.wholeStage=false;
spark.sql.codegen.wholeStage	false
Time taken: 3.095 seconds, Fetched 1 row(s)
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1	2
NULL	NULL
Time taken: 1.214 seconds, Fetched 2 row(s)
spark-sql>

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

Change `Inline.eval` to return a row of null values rather than a null row in the case of a null input struct.

Consider the following query:
```
set spark.sql.codegen.wholeStage=false;
select inline(array(named_struct('a', 1, 'b', 2), null));
```
This query fails with a `NullPointerException`:
```
22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
```
(In Spark 3.1.3, you don't need to set `spark.sql.codegen.wholeStage` to false to reproduce the error, since Spark 3.1.3 has no codegen path for `Inline`).

This query fails regardless of the setting of `spark.sql.codegen.wholeStage`:
```
val dfWide = (Seq((1))
  .toDF("col0")
  .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))

val df = (dfWide
  .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as struct_array"))

df.selectExpr("*", "inline(struct_array)").collect
```
It fails with
```
22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown Source)
```
When `Inline.eval` returns a null row in the collection, GenerateExec gets a NullPointerException either when joining the null row with required child output, or projecting the null row.

This PR avoids producing the null row and produces a row of null values instead:
```
spark-sql> set spark.sql.codegen.wholeStage=false;
spark.sql.codegen.wholeStage	false
Time taken: 3.095 seconds, Fetched 1 row(s)
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1	2
NULL	NULL
Time taken: 1.214 seconds, Fetched 2 row(s)
spark-sql>
```

No.

New unit test.

Closes apache#36903 from bersprockets/inline_eval_null_struct_issue.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@github-actions github-actions bot added the SQL label Jun 21, 2022
@HyukjinKwon
Copy link
Member

Merged to branch-3.1.

HyukjinKwon pushed a commit that referenced this pull request Jun 21, 2022
### What changes were proposed in this pull request?

Backport of #36903

Change `Inline.eval` to return a row of null values rather than a null row in the case of a null input struct.

### Why are the changes needed?

Consider the following query:
```
set spark.sql.codegen.wholeStage=false;
select inline(array(named_struct('a', 1, 'b', 2), null));
```
This query fails with a `NullPointerException`:
```
22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
```
(In Spark 3.1.x, you don't need to set `spark.sql.codegen.wholeStage` to false to reproduce the error, since Spark 3.1.x has no codegen path for `Inline`).

This query fails regardless of the setting of `spark.sql.codegen.wholeStage`:
```
val dfWide = (Seq((1))
  .toDF("col0")
  .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))

val df = (dfWide
  .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as struct_array"))

df.selectExpr("*", "inline(struct_array)").collect
```
It fails with
```
22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown Source)
```
When `Inline.eval` returns a null row in the collection, GenerateExec gets a NullPointerException either when joining the null row with required child output, or projecting the null row.

This PR avoids producing the null row and produces a row of null values instead:
```
spark-sql> set spark.sql.codegen.wholeStage=false;
spark.sql.codegen.wholeStage	false
Time taken: 3.095 seconds, Fetched 1 row(s)
spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1	2
NULL	NULL
Time taken: 1.214 seconds, Fetched 2 row(s)
spark-sql>
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

Closes #36949 from bersprockets/inline_eval_null_struct_issue_31.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@bersprockets bersprockets deleted the inline_eval_null_struct_issue_31 branch August 10, 2022 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants