-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29140][SQL] Handle parameters having "array" of javaType properly in splitAggregateExpressions #25830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc. @maropu @cloud-fan @viirya |
|
Btw, I had to fix some code locally to get generated code for failing case. We are logging generated code via |
| import testImplicits._ | ||
|
|
||
| test("SPARK-29140 HashAggregateExec aggregating binary type doesn't break codegen compilation") { | ||
| val withDistinct = countDistinct($"c1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to AggregationQuerySuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, didn't indicate the existence of suite. Will move.
| .groupBy($"id" % 10 as "group") | ||
| .agg(withDistinct) | ||
| .orderBy("group") | ||
| aggDf.collect().toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz check the result.
| """.stripMargin | ||
| } | ||
|
|
||
| private def typeNameForCodegen(clazz: Class[_]): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to move this helper function to CodeGenerator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and just call typeName?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestions of both comments! Will address.
|
In the PR title, array types is more obvious than binary types? |
|
Test build #110908 has finished for PR 25830 at commit
|
array type is |
Yes exactly. I suspected about ArrayType for the first time, and realized its javaType is not array. So for now, only BinaryType hits the issue, if I understand correctly. Maybe |
e0a92a5 to
5ffa7e7
Compare
|
Updated, please take a next round of review. Thanks! |
| val withDistinct = countDistinct($"c1") | ||
|
|
||
| val schema = new StructType().add("c1", BinaryType, nullable = true) | ||
| val schemaWithId = StructType(StructField("id", IntegerType, nullable = false) +: schema.fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: curious why you don't just have schema? schema is not used in other place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's missed. I copied the test code from ObjectHashAggregateSuite (as the test actually failed there) and tried to minimize the code to reproduce and clean up, but missed this. Thanks for pointing out!
Oh... that's my fault..., anyway nice catch! |
|
btw, |
Oh, this is good point. I think it is only said that aggregate function's buffer attributes can only be in certain types. Aggregate functions still can have inputs of complex data type? However I don't know if there are existing aggregate functions in Spark take complex inputs but use buffer attributes that are acceptable for HashAggregateExec. Just think in theory we might have such aggregate functions? |
|
Btw, submitted a patch #25835 for following up my own comment: #25830 (comment) Please review if it makes sense. Thanks! |
|
And another possible improvement on the randomized test in ObjectHashAggregateSuite... How about logging selected parameters as WARN/ERROR or include it to hint message on assert? Actually I had to modify the test to run the the test code until it fails, as some information is provided in test name but others like schema are not. |
might be so (IIUC we have no restriction about that; buffers should have the same type with input data). So, for safeguards, how about turning off the split mode instead of forcibly passing complex data into split functions? |
|
Test build #110938 has finished for PR 25830 at commit
|
|
Test build #110934 has finished for PR 25830 at commit
|
|
As I said, I think it is possible an aggregate function accesses complex data input like array but uses a buffer attribute which is supported by HashAggregateExec. If you just filter out complex data types out, the split function for such aggregation function won't work. So currently this looks good to me. I am not sure if we want to turn off split mode just because of array argument as @maropu suggested. cc @cloud-fan |
|
I think the added test well demonstrate that there are agg functions take complex input and use simple buffers. |
| val emptyRows = spark.sparkContext.parallelize(Seq.empty[Row], 1) | ||
| val aggDf = spark.createDataFrame(emptyRows, schema) | ||
| .groupBy($"id" % 10 as "group") | ||
| .agg(withDistinct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can simply put countDistinct($"c1") here.
| val aggDf = spark.createDataFrame(emptyRows, schema) | ||
| .groupBy($"id" % 10 as "group") | ||
| .agg(withDistinct) | ||
| .orderBy("group") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need .orderby for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can be removed. Will remove.
|
Test build #111059 has finished for PR 25830 at commit
|
|
retest this please |
|
Test build #111065 has finished for PR 25830 at commit
|
ce2b17f to
28726da
Compare
| test("SPARK-29122: hash-based aggregates for unfixed-length decimals in the interpreter mode") { | ||
| withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false", | ||
| SQLConf.CODEGEN_FACTORY_MODE.key -> CodegenObjectFactoryMode.NO_CODEGEN.toString) { | ||
| SQLConf.CODEGEN_FACTORY_MODE.key -> CodegenObjectFactoryMode.NO_CODEGEN.toString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think previous is correct...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah IDE automatically indented while fixing conflicts. Will roll back.
|
Test build #111089 has finished for PR 25830 at commit
|
|
I modified the title a bit. |
|
Test build #111096 has finished for PR 25830 at commit
|
|
Thanks, all! Merged to master. |
|
Thanks all for reviewing and merging! |
What changes were proposed in this pull request?
This patch fixes the issue brought by SPARK-21870: when generating code for parameter type, it doesn't consider array type in javaType. At least we have one, Spark should generate code for BinaryType as
byte[], but Spark create the code for BinaryType as[Band generated code fails compilation.Below is the generated code which failed compilation (Line 380):
There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky.
Why are the changes needed?
Without the fix, generated code from HashAggregateExec may fail compilation.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added new UT. Without the fix, newly added UT fails.