Commit b040a83
[SPARK-18186][SC-5406][BRANCH-2.1] Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
While being evaluated in Spark SQL, Hive UDAFs don't support partial aggregation. This PR migrates `HiveUDAFFunction`s to `TypedImperativeAggregate`, which already provides partial aggregation support for aggregate functions that may use arbitrary Java objects as aggregation states.
The following snippet shows the effect of this PR:
```scala
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax
sql(s"CREATE FUNCTION hive_max AS '${classOf[GenericUDAFMax].getName}'")
spark.range(100).createOrReplaceTempView("t")
// A query using both Spark SQL native `max` and Hive `max`
sql(s"SELECT max(id), hive_max(id) FROM t").explain()
```
Before this PR:
```
== Physical Plan ==
SortAggregate(key=[], functions=[max(id#1L), default.hive_max(default.hive_max, HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax7475f57e), id#1L, false, 0, 0)])
+- Exchange SinglePartition
+- *Range (0, 100, step=1, splits=Some(1))
```
After this PR:
```
== Physical Plan ==
SortAggregate(key=[], functions=[max(id#1L), default.hive_max(default.hive_max, HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax5e18a6a7), id#1L, false, 0, 0)])
+- Exchange SinglePartition
+- SortAggregate(key=[], functions=[partial_max(id#1L), partial_default.hive_max(default.hive_max, HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax5e18a6a7), id#1L, false, 0, 0)])
+- *Range (0, 100, step=1, splits=Some(1))
```
The tricky part of the PR is mostly about updating and passing around aggregation states of `HiveUDAFFunction`s since the aggregation state of a Hive UDAF may appear in three different forms. Let's take a look at the testing `MockUDAF` added in this PR as an example. This UDAF computes the count of non-null values together with the count of nulls of a given column. Its aggregation state may appear as the following forms at different time:
1. A `MockUDAFBuffer`, which is a concrete subclass of `GenericUDAFEvaluator.AggregationBuffer`
The form used by Hive UDAF API. This form is required by the following scenarios:
- Calling `GenericUDAFEvaluator.iterate()` to update an existing aggregation state with new input values.
- Calling `GenericUDAFEvaluator.terminate()` to get the final aggregated value from an existing aggregation state.
- Calling `GenericUDAFEvaluator.merge()` to merge other aggregation states into an existing aggregation state.
The existing aggregation state to be updated must be in this form.
Conversions:
- To form 2:
`GenericUDAFEvaluator.terminatePartial()`
- To form 3:
Convert to form 2 first, and then to 3.
2. An `Object[]` array containing two `java.lang.Long` values.
The form used to interact with Hive's `ObjectInspector`s. This form is required by the following scenarios:
- Calling `GenericUDAFEvaluator.terminatePartial()` to convert an existing aggregation state in form 1 to form 2.
- Calling `GenericUDAFEvaluator.merge()` to merge other aggregation states into an existing aggregation state.
The input aggregation state must be in this form.
Conversions:
- To form 1:
No direct method. Have to create an empty `AggregationBuffer` and merge it into the empty buffer.
- To form 3:
`unwrapperFor()`/`unwrap()` method of `HiveInspectors`
3. The byte array that holds data of an `UnsafeRow` with two `LongType` fields.
The form used by Spark SQL to shuffle partial aggregation results. This form is required because `TypedImperativeAggregate` always asks its subclasses to serialize their aggregation states into a byte array.
Conversions:
- To form 1:
Convert to form 2 first, and then to 1.
- To form 2:
`wrapperFor()`/`wrap()` method of `HiveInspectors`
Here're some micro-benchmark results produced by the most recent master and this PR branch.
Master:
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
hive udaf vs spark af: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
w/o groupBy 339 / 372 3.1 323.2 1.0X
w/ groupBy 503 / 529 2.1 479.7 0.7X
```
This PR:
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
hive udaf vs spark af: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
w/o groupBy 116 / 126 9.0 110.8 1.0X
w/ groupBy 151 / 159 6.9 144.0 0.8X
```
Benchmark code snippet:
```scala
test("Hive UDAF benchmark") {
val N = 1 << 20
sparkSession.sql(s"CREATE TEMPORARY FUNCTION hive_max AS '${classOf[GenericUDAFMax].getName}'")
val benchmark = new Benchmark(
name = "hive udaf vs spark af",
valuesPerIteration = N,
minNumIters = 5,
warmupTime = 5.seconds,
minTime = 5.seconds,
outputPerIteration = true
)
benchmark.addCase("w/o groupBy") { _ =>
sparkSession.range(N).agg("id" -> "hive_max").collect()
}
benchmark.addCase("w/ groupBy") { _ =>
sparkSession.range(N).groupBy($"id" % 10).agg("id" -> "hive_max").collect()
}
benchmark.run()
sparkSession.sql(s"DROP TEMPORARY FUNCTION IF EXISTS hive_max")
}
```
New test suite `HiveUDAFSuite` is added.
Author: Cheng Lian <liandatabricks.com>
Closes apache#15703 from liancheng/partial-agg-hive-udaf.
Author: Cheng Lian <[email protected]>
Closes apache#144 from yhuai/branch-2.1-hive-udaf.1 parent eaf733d commit b040a83
File tree
2 files changed
+301
-50
lines changed- sql/hive/src
- main/scala/org/apache/spark/sql/hive
- test/scala/org/apache/spark/sql/hive/execution
2 files changed
+301
-50
lines changedLines changed: 149 additions & 50 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
20 | 22 | | |
21 | 23 | | |
22 | 24 | | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
| 28 | + | |
26 | 29 | | |
27 | 30 | | |
28 | | - | |
29 | | - | |
| 31 | + | |
30 | 32 | | |
31 | 33 | | |
32 | 34 | | |
| |||
58 | 60 | | |
59 | 61 | | |
60 | 62 | | |
61 | | - | |
| 63 | + | |
62 | 64 | | |
63 | 65 | | |
64 | 66 | | |
| |||
75 | 77 | | |
76 | 78 | | |
77 | 79 | | |
78 | | - | |
| 80 | + | |
79 | 81 | | |
80 | 82 | | |
81 | 83 | | |
| |||
263 | 265 | | |
264 | 266 | | |
265 | 267 | | |
266 | | - | |
267 | | - | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
268 | 297 | | |
269 | 298 | | |
270 | 299 | | |
| |||
273 | 302 | | |
274 | 303 | | |
275 | 304 | | |
276 | | - | |
| 305 | + | |
277 | 306 | | |
278 | 307 | | |
279 | 308 | | |
280 | 309 | | |
281 | 310 | | |
282 | 311 | | |
283 | 312 | | |
| 313 | + | |
284 | 314 | | |
285 | | - | |
286 | | - | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
287 | 323 | | |
288 | 324 | | |
289 | 325 | | |
290 | 326 | | |
291 | 327 | | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
297 | | - | |
298 | | - | |
299 | | - | |
| 328 | + | |
| 329 | + | |
300 | 330 | | |
301 | 331 | | |
| 332 | + | |
302 | 333 | | |
303 | | - | |
| 334 | + | |
304 | 335 | | |
| 336 | + | |
305 | 337 | | |
306 | | - | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
307 | 342 | | |
| 343 | + | |
308 | 344 | | |
309 | | - | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
310 | 350 | | |
| 351 | + | |
311 | 352 | | |
312 | | - | |
| 353 | + | |
313 | 354 | | |
| 355 | + | |
314 | 356 | | |
315 | | - | |
316 | | - | |
317 | | - | |
| 357 | + | |
318 | 358 | | |
| 359 | + | |
319 | 360 | | |
320 | | - | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
321 | 365 | | |
| 366 | + | |
322 | 367 | | |
323 | | - | |
| 368 | + | |
324 | 369 | | |
| 370 | + | |
| 371 | + | |
325 | 372 | | |
326 | | - | |
327 | | - | |
328 | | - | |
329 | | - | |
330 | | - | |
331 | | - | |
332 | | - | |
333 | | - | |
334 | | - | |
335 | | - | |
336 | | - | |
337 | | - | |
338 | | - | |
339 | | - | |
340 | | - | |
| 373 | + | |
341 | 374 | | |
342 | | - | |
343 | | - | |
344 | | - | |
345 | | - | |
346 | | - | |
| 375 | + | |
| 376 | + | |
347 | 377 | | |
348 | | - | |
349 | | - | |
350 | | - | |
| 378 | + | |
| 379 | + | |
351 | 380 | | |
352 | 381 | | |
353 | 382 | | |
354 | 383 | | |
355 | 384 | | |
356 | 385 | | |
357 | 386 | | |
358 | | - | |
| 387 | + | |
359 | 388 | | |
360 | 389 | | |
361 | 390 | | |
| |||
365 | 394 | | |
366 | 395 | | |
367 | 396 | | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
368 | 467 | | |
0 commit comments