[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions #25830

HeartSaVioR · 2019-09-18T13:04:52Z

What changes were proposed in this pull request?

This patch fixes the issue brought by SPARK-21870: when generating code for parameter type, it doesn't consider array type in javaType. At least we have one, Spark should generate code for BinaryType as byte[], but Spark create the code for BinaryType as [B and generated code fails compilation.

Below is the generated code which failed compilation (Line 380):

/* 380 */   private void agg_doAggregate_count_0([B agg_expr_1_1, boolean agg_exprIsNull_1_1, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_1) throws java.io.IOException {
/* 381 */     // evaluate aggregate function for count
/* 382 */     boolean agg_isNull_26 = false;
/* 383 */     long agg_value_28 = -1L;
/* 384 */     if (!false && agg_exprIsNull_1_1) {
/* 385 */       long agg_value_31 = agg_unsafeRowAggBuffer_1.getLong(1);
/* 386 */       agg_isNull_26 = false;
/* 387 */       agg_value_28 = agg_value_31;
/* 388 */     } else {
/* 389 */       long agg_value_33 = agg_unsafeRowAggBuffer_1.getLong(1);
/* 390 */
/* 391 */       long agg_value_32 = -1L;
/* 392 */
/* 393 */       agg_value_32 = agg_value_33 + 1L;
/* 394 */       agg_isNull_26 = false;
/* 395 */       agg_value_28 = agg_value_32;
/* 396 */     }
/* 397 */     // update unsafe row buffer
/* 398 */     agg_unsafeRowAggBuffer_1.setLong(1, agg_value_28);
/* 399 */   }

There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky.

Why are the changes needed?

Without the fix, generated code from HashAggregateExec may fail compilation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new UT. Without the fix, newly added UT fails.

HeartSaVioR · 2019-09-18T13:05:37Z

cc. @maropu @cloud-fan @viirya

HeartSaVioR · 2019-09-18T13:08:59Z

Btw, I had to fix some code locally to get generated code for failing case. We are logging generated code via INFO, but for CI build it might be better if we log generated code via ERROR if test fails due to this.

maropu · 2019-09-18T14:00:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/aggregate/HashAggregateSuite.scala

+  import testImplicits._
+
+  test("SPARK-29140 HashAggregateExec aggregating binary type doesn't break codegen compilation") {
+    val withDistinct = countDistinct($"c1")


Move to AggregationQuerySuite?

Thanks, didn't indicate the existence of suite. Will move.

maropu · 2019-09-18T14:01:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/aggregate/HashAggregateSuite.scala

+        .groupBy($"id" % 10 as "group")
+        .agg(withDistinct)
+        .orderBy("group")
+      aggDf.collect().toSeq


plz check the result.

maropu · 2019-09-18T14:03:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

     """.stripMargin
  }

+  private def typeNameForCodegen(clazz: Class[_]): String = {


It might be better to move this helper function to CodeGenerator.

and just call typeName?

Nice suggestions of both comments! Will address.

maropu · 2019-09-18T14:05:49Z

In the PR title, array types is more obvious than binary types?

SparkQA · 2019-09-18T17:12:53Z

Test build #110908 has finished for PR 25830 at commit 9c76557.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-18T17:28:54Z

array types is more obvious than binary types?

array type is ArrayData, which doesn't have this issue.

HeartSaVioR · 2019-09-18T20:43:43Z

array types is more obvious than binary types?

array type is ArrayData, which doesn't have this issue.

Yes exactly. I suspected about ArrayType for the first time, and realized its javaType is not array. So for now, only BinaryType hits the issue, if I understand correctly. Maybe array types in javaType could be more obvious.

HeartSaVioR · 2019-09-18T20:55:33Z

Updated, please take a next round of review. Thanks!

viirya · 2019-09-18T21:42:11Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

+    val withDistinct = countDistinct($"c1")
+
+    val schema = new StructType().add("c1", BinaryType, nullable = true)
+    val schemaWithId = StructType(StructField("id", IntegerType, nullable = false) +: schema.fields)


nit: curious why you don't just have schema? schema is not used in other place.

Ah that's missed. I copied the test code from ObjectHashAggregateSuite (as the test actually failed there) and tried to minimize the code to reproduce and clean up, but missed this. Thanks for pointing out!

maropu · 2019-09-18T23:22:39Z

There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky.

Oh... that's my fault..., anyway nice catch!

maropu · 2019-09-18T23:30:33Z

btw, agg_expr_1_1 is not used inside the aggregation function. Actually, in HashAggregateExec, aggregation functions can only process fixed-length typed data and decimal values?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L1126
If so, I think we should filter out these complex types from the arguments of aggregate functions.

viirya · 2019-09-18T23:45:03Z

btw, agg_expr_1_1 is not used inside the aggregation function. Actually, in HashAggregateExec, aggregation functions can only process fixed-length typed data and decimal values?

Oh, this is good point. I think it is only said that aggregate function's buffer attributes can only be in certain types. Aggregate functions still can have inputs of complex data type?

However I don't know if there are existing aggregate functions in Spark take complex inputs but use buffer attributes that are acceptable for HashAggregateExec.

Just think in theory we might have such aggregate functions?

HeartSaVioR · 2019-09-18T23:55:08Z

Btw, submitted a patch #25835 for following up my own comment: #25830 (comment)

Please review if it makes sense. Thanks!

HeartSaVioR · 2019-09-18T23:59:23Z

And another possible improvement on the randomized test in ObjectHashAggregateSuite... How about logging selected parameters as WARN/ERROR or include it to hint message on assert? Actually I had to modify the test to run the the test code until it fails, as some information is provided in test name but others like schema are not.

maropu · 2019-09-19T00:28:11Z

Oh, this is good point. I think it is only said that aggregate function's buffer attributes can only be in certain types. Aggregate functions still can have inputs of complex data type?
However I don't know if there are existing aggregate functions in Spark take complex inputs but use buffer attributes that are acceptable for HashAggregateExec.
Just think in theory we might have such aggregate functions?

might be so (IIUC we have no restriction about that; buffers should have the same type with input data). So, for safeguards, how about turning off the split mode instead of forcibly passing complex data into split functions?

SparkQA · 2019-09-19T02:17:55Z

Test build #110938 has finished for PR 25830 at commit 7c66afc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-19T03:00:03Z

Test build #110934 has finished for PR 25830 at commit 5ffa7e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-20T05:31:16Z

Just to determine the next action, would we want to include newer discussion (@viirya and @maropu are discussing) for the scope of this PR?

viirya · 2019-09-20T05:41:25Z

As I said, I think it is possible an aggregate function accesses complex data input like array but uses a buffer attribute which is supported by HashAggregateExec.

If you just filter out complex data types out, the split function for such aggregation function won't work.

So currently this looks good to me.

I am not sure if we want to turn off split mode just because of array argument as @maropu suggested. cc @cloud-fan

cloud-fan · 2019-09-20T06:35:12Z

I think the added test well demonstrate that there are agg functions take complex input and use simple buffers.

cloud-fan · 2019-09-20T06:36:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

+      val emptyRows = spark.sparkContext.parallelize(Seq.empty[Row], 1)
+      val aggDf = spark.createDataFrame(emptyRows, schema)
+        .groupBy($"id" % 10 as "group")
+        .agg(withDistinct)


nit: we can simply put countDistinct($"c1") here.

maropu · 2019-09-20T06:50:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

+      val aggDf = spark.createDataFrame(emptyRows, schema)
+        .groupBy($"id" % 10 as "group")
+        .agg(withDistinct)
+        .orderBy("group")


we need .orderby for this test?

That can be removed. Will remove.

SparkQA · 2019-09-20T10:58:23Z

Test build #111059 has finished for PR 25830 at commit ce2b17f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-20T11:08:37Z

retest this please

SparkQA · 2019-09-20T15:40:56Z

Test build #111065 has finished for PR 25830 at commit ce2b17f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…regateExec

viirya · 2019-09-20T20:36:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

  test("SPARK-29122: hash-based aggregates for unfixed-length decimals in the interpreter mode") {
    withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false",
-        SQLConf.CODEGEN_FACTORY_MODE.key -> CodegenObjectFactoryMode.NO_CODEGEN.toString) {
+      SQLConf.CODEGEN_FACTORY_MODE.key -> CodegenObjectFactoryMode.NO_CODEGEN.toString) {


Actually, I think previous is correct...

Yeah IDE automatically indented while fixing conflicts. Will roll back.

SparkQA · 2019-09-20T22:07:20Z

Test build #111089 has finished for PR 25830 at commit 28726da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-20T23:36:13Z

I modified the title a bit.

SparkQA · 2019-09-21T03:29:12Z

Test build #111096 has finished for PR 25830 at commit 4c00a2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-21T07:30:16Z

Thanks, all! Merged to master.

HeartSaVioR · 2019-09-21T08:07:11Z

Thanks all for reviewing and merging!

maropu reviewed Sep 18, 2019

View reviewed changes

dongjoon-hyun added SQL TESTS labels Sep 18, 2019

HeartSaVioR changed the title ~~[SPARK-29140][SQL] Handle BinaryType of parameter properly in HashAggregateExec~~ [SPARK-29140][SQL] Handle parameters having "array" of javaType properly in HashAggregateExec Sep 18, 2019

HeartSaVioR force-pushed the SPARK-29140 branch from e0a92a5 to 5ffa7e7 Compare September 18, 2019 20:55

viirya reviewed Sep 18, 2019

View reviewed changes

viirya approved these changes Sep 18, 2019

View reviewed changes

cloud-fan reviewed Sep 20, 2019

View reviewed changes

cloud-fan approved these changes Sep 20, 2019

View reviewed changes

maropu reviewed Sep 20, 2019

View reviewed changes

HeartSaVioR added 4 commits September 21, 2019 05:02

[SPARK-29140][SQL] Handle BinaryType of parameter properly in HashAgg…

de7249a

…regateExec

Reflect review comments

9ce98b2

Additional review comment for nits

f045457

Reflect review comments

28726da

HeartSaVioR force-pushed the SPARK-29140 branch from ce2b17f to 28726da Compare September 20, 2019 20:06

viirya reviewed Sep 20, 2019

View reviewed changes

Roll back unrelated change

4c00a2b

maropu changed the title ~~[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in HashAggregateExec~~ [SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions Sep 20, 2019

maropu approved these changes Sep 20, 2019

View reviewed changes

maropu closed this in f7cc695 Sep 21, 2019

HeartSaVioR deleted the SPARK-29140 branch September 21, 2019 08:07

[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions #25830

[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions #25830

Uh oh!

Conversation

HeartSaVioR commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Sep 18, 2019

Uh oh!

HeartSaVioR commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 18, 2019

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

cloud-fan commented Sep 18, 2019

Uh oh!

HeartSaVioR commented Sep 18, 2019

Uh oh!

HeartSaVioR commented Sep 18, 2019

Uh oh!

viirya Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 18, 2019

Uh oh!

maropu commented Sep 18, 2019

Uh oh!

viirya commented Sep 18, 2019

Uh oh!

HeartSaVioR commented Sep 18, 2019

Uh oh!

HeartSaVioR commented Sep 18, 2019

Uh oh!

maropu commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 20, 2019

Uh oh!

viirya commented Sep 20, 2019

Uh oh!

cloud-fan commented Sep 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

maropu commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

HeartSaVioR commented Sep 18, 2019 •

edited

Loading

HeartSaVioR commented Sep 18, 2019 •

edited

Loading

viirya Sep 18, 2019 •

edited

Loading

HeartSaVioR Sep 18, 2019 •

edited

Loading

maropu commented Sep 19, 2019 •

edited

Loading