[SPARK-20770][SQL] Improve ColumnStats#18002
Conversation
|
Test build #76975 has finished for PR 18002 at commit
|
|
@hvanhovell would it be possible to take a look? |
| if (!row.isNullAt(ordinal)) { | ||
| count += 1 | ||
| } else { | ||
| super.gatherNullStats |
There was a problem hiding this comment.
do we need the super keyword here?
| testDecimalColumnStats(createRow(null, null, 0)) | ||
|
|
||
| def createRow(values: Any*): GenericInternalRow = new GenericInternalRow(values.toArray) | ||
| def createRow(values: Any*): Array[Any] = values.toArray |
There was a problem hiding this comment.
do we still need this method?
|
LGTM |
| if (!row.isNullAt(ordinal)) { | ||
| sizeInBytes += BINARY.actualSize(row, ordinal) | ||
| val size = BINARY.actualSize(row, ordinal) | ||
| gatherValueStats(size) |
There was a problem hiding this comment.
Nit: we may not need gatherValueStats here. Simply inline:
sizeInBytes += BINARY.actualSize(row, ordinal)
count += 1
|
|
||
| override def collectedStatistics: GenericInternalRow = | ||
| new GenericInternalRow(Array[Any](null, null, nullCount, count, sizeInBytes)) | ||
| def gatherValueStats(size: Int): Unit = { |
|
LGTM |
|
Test build #77029 has finished for PR 18002 at commit
|
| if (upper == null || value.compareTo(upper) > 0) upper = value.clone() | ||
| if (lower == null || value.compareTo(lower) < 0) lower = value.clone() | ||
| sizeInBytes += STRING.actualSize(row, ordinal) | ||
| val size = STRING.actualSize(row, ordinal) |
There was a problem hiding this comment.
not related, but STRING.actualSize should just take UTF8String
There was a problem hiding this comment.
I may not understand your point.
Do you want to use row.getUTF8String(ordinal).numBytes() + 4 instead of calling STRING.actualSize()? (i.e. method inlining).
There was a problem hiding this comment.
I mean we can just pass the UTF8String to STRING.actualSize
There was a problem hiding this comment.
In STRING.actualSize, we call row.getUTF8String(ordinal), so why not we pass in the UTF8String directly?
There was a problem hiding this comment.
Do you want to add the new method STRING.actualSize(s: UTF8String)? The current signature actualSize(row: InternalRow, ordinal: Int) cannot be changed since it is declared at the super class.
|
|
||
| override def collectedStatistics: GenericInternalRow = | ||
| new GenericInternalRow(Array[Any](null, null, nullCount, count, sizeInBytes)) | ||
| def gatherValueStats(size: Int): Unit = { |
| if (lower == null || value.compareTo(lower) < 0) lower = value | ||
| // TODO: this is not right for DecimalType with precision > 18 | ||
| sizeInBytes += 8 | ||
| val size = 8 |
There was a problem hiding this comment.
can we just hardcode 8 in gatherValueStats?
|
Test build #77050 has finished for PR 18002 at commit
|
|
@cloud-fan and @viirya, thank you for good comments. I am looking forward to merging it into master. |
|
+1, LGTM, too. |
|
@cloud-fan could you please let me know if I have to do additional things for this PR? |
|
thanks, merging to master! |
|
Thank you very much |
## What changes were proposed in this pull request? This PR improves the implementation of `ColumnStats` by using the following appoaches. 1. Declare subclasses of `ColumnStats` as `final` 2. Remove unnecessary call of `row.isNullAt(ordinal)` 3. Remove the dependency on `GenericInternalRow` For 1., this declaration encourages method inlining and other optimizations of JIT compiler For 2., in `gatherStats()`, while previous code in subclasses of `ColumnStats` always calls `row.isNullAt()` twice, the PR just calls `row.isNullAt()` only once. For 3., `collectedStatistics()` returns `Array[Any]` instead of `GenericInternalRow`. This removes the dependency of unnecessary package and reduces the number of allocations of `GenericInternalRow`. In addition to that, in the future, `gatherValueStats()`, which is specialized for each data type, can be effectively called from the generated code without using generic data structure `InternalRow`. ## How was this patch tested? Tested by existing test suite Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#18002 from kiszk/SPARK-20770.
What changes were proposed in this pull request?
This PR improves the implementation of
ColumnStatsby using the following appoaches.ColumnStatsasfinalrow.isNullAt(ordinal)GenericInternalRowFor 1., this declaration encourages method inlining and other optimizations of JIT compiler
For 2., in
gatherStats(), while previous code in subclasses ofColumnStatsalways callsrow.isNullAt()twice, the PR just callsrow.isNullAt()only once.For 3.,
collectedStatistics()returnsArray[Any]instead ofGenericInternalRow. This removes the dependency of unnecessary package and reduces the number of allocations ofGenericInternalRow.In addition to that, in the future,
gatherValueStats(), which is specialized for each data type, can be effectively called from the generated code without using generic data structureInternalRow.How was this patch tested?
Tested by existing test suite