[SPARK-22643][SQL] ColumnarArray should be an immutable view#19842
[SPARK-22643][SQL] ColumnarArray should be an immutable view#19842cloud-fan wants to merge 2 commits intoapache:masterfrom
Conversation
|
Test build #84285 has finished for PR 19842 at commit
|
|
Jenkins, retest this please |
|
Test build #84294 has finished for PR 19842 at commit
|
|
@cloud-fan TPCDS does not have nested data or arrays. So I think we have to redo the benchmarks. A simple micro benchmark that touches a few elements in the array should probably do it. |
|
LGTM - pending benchmarks :) |
| resultArray.length = getArrayLength(rowId); | ||
| resultArray.offset = getArrayOffset(rowId); | ||
| return resultArray; | ||
| return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); |
There was a problem hiding this comment.
Is it better to create ColumnarArray for each rowID only once (e.g. by using caching)? I am curious whether we would see performance overhead for creating ColumnarArray to access elements of a multi-dimensional array (e.g. a[1][2] + a[1][3]).
There was a problem hiding this comment.
I don't think that is a good idea. That would require us to keep an array of ColumnarArray around. That might mess with both GC and escape analysis. Let's just create a benchmark and check if we do not regress.
| } | ||
|
|
||
| /* | ||
| Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz |
There was a problem hiding this comment.
I rerun all the benchmarks in this file and update the results
| ByteBuffer API 1411 / 1418 232.2 4.3 0.1X | ||
| DirectByteBuffer 467 / 474 701.8 1.4 0.4X | ||
| Unsafe Buffer 178 / 185 1843.6 0.5 1.0X | ||
| Column(on heap) 178 / 184 1840.8 0.5 1.0X |
There was a problem hiding this comment.
Previusly onheap column vector was much faster than java array, which is unreasonable and I can't reproduce it now.
| String Read/Write: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------ | ||
| On Heap 332 / 338 49.3 20.3 1.0X | ||
| Off Heap 466 / 467 35.2 28.4 0.7X |
There was a problem hiding this comment.
this is due to the data copy saved in https://github.com/apache/spark/pull/19815/files#diff-f43d67d60091eab39c1310330bf7a8ffR211
| On Heap Read Size Only 415 / 422 394.7 2.5 1.0X | ||
| Off Heap Read Size Only 394 / 402 415.9 2.4 1.1X | ||
| On Heap Read Elements 2558 / 2593 64.0 15.6 0.2X | ||
| Off Heap Read Elements 3316 / 3317 49.4 20.2 0.1X |
There was a problem hiding this comment.
the result before this PR
On Heap Read Size Only 83 / 92 1970.3 0.5 1.0X
Off Heap Read Size Only 98 / 110 1669.1 0.6 0.8X
On Heap Read Elements 3190 / 3203 51.4 19.5 0.0X
Off Heap Read Elements 3106 / 3146 52.8 19.0 0.0X
For the worst case, we just get the array and get its size, reusing the object has a good improvement. However if we also need to access the array elements(should be the most common case), the overhead is negligible
There was a problem hiding this comment.
Thank you for running a benchmark. I understand reusing the object has a good performance.
I am curious whether the current catalyst can generate such a Java code for accessing nested array elements in SQL selectExpr("a[1][1] + a[1][2] + a[1][3] + a[1][4] + a[1][5]").
|
Test build #84307 has finished for PR 19842 at commit
|
|
Since the benchmark shows negligible overhead for normal cases, I'm merging it to master, thanks! |
## What changes were proposed in this pull request? Similar to apache#19842 , we should also make `ColumnarRow` an immutable view, and move forward to make `ColumnVector` public. ## How was this patch tested? Existing tests. The performance concern should be same as apache#19842 . Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19898 from cloud-fan/row-id.
What changes were proposed in this pull request?
To make
ColumnVectorpublic,ColumnarArrayneed to be public too, and we should not have mutable public fields in a public class. This PR proposes to makeColumnarArrayan immutable view of the data, and always create a new instance ofColumnarArrayinColumnVector#getArrayHow was this patch tested?
new benchmark in
ColumnarBatchBenchmark