[SPARK-20770][SQL] Improve ColumnStats by kiszk · Pull Request #18002 · apache/spark

kiszk · 2017-05-16T15:58:41Z

What changes were proposed in this pull request?

This PR improves the implementation of ColumnStats by using the following appoaches.

Declare subclasses of ColumnStats as final
Remove unnecessary call of row.isNullAt(ordinal)
Remove the dependency on GenericInternalRow

For 1., this declaration encourages method inlining and other optimizations of JIT compiler
For 2., in gatherStats(), while previous code in subclasses of ColumnStats always calls row.isNullAt() twice, the PR just calls row.isNullAt() only once.
For 3., collectedStatistics() returns Array[Any] instead of GenericInternalRow. This removes the dependency of unnecessary package and reduces the number of allocations of GenericInternalRow.

In addition to that, in the future, gatherValueStats(), which is specialized for each data type, can be effectively called from the generated code without using generic data structure InternalRow.

How was this patch tested?

Tested by existing test suite

SparkQA · 2017-05-16T18:17:39Z

Test build #76975 has finished for PR 18002 at commit 11057f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-05-16T18:39:45Z

@hvanhovell would it be possible to take a look?
cc @cloud-fan

cloud-fan · 2017-05-17T08:03:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+    if (!row.isNullAt(ordinal)) {
+      count += 1
+    } else {
+      super.gatherNullStats


do we need the super keyword here?

Good catch. done.

cloud-fan · 2017-05-17T08:06:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnStatsSuite.scala

  testDecimalColumnStats(createRow(null, null, 0))

-  def createRow(values: Any*): GenericInternalRow = new GenericInternalRow(values.toArray)
+  def createRow(values: Any*): Array[Any] = values.toArray


do we still need this method?

I see. Eliminated.

cloud-fan · 2017-05-17T08:06:34Z

LGTM

viirya · 2017-05-17T08:26:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

    if (!row.isNullAt(ordinal)) {
-      sizeInBytes += BINARY.actualSize(row, ordinal)
+      val size = BINARY.actualSize(row, ordinal)
+      gatherValueStats(size)


Nit: we may not need gatherValueStats here. Simply inline:

sizeInBytes += BINARY.actualSize(row, ordinal) count += 1

Thanks, done.

viirya · 2017-05-17T08:31:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala


-  override def collectedStatistics: GenericInternalRow =
-    new GenericInternalRow(Array[Any](null, null, nullCount, count, sizeInBytes))
+  def gatherValueStats(size: Int): Unit = {


Nit: we can inline this too.

viirya · 2017-05-17T08:33:39Z

LGTM

SparkQA · 2017-05-17T22:04:58Z

Test build #77029 has finished for PR 18002 at commit 3e3ffde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-18T03:08:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

-      if (upper == null || value.compareTo(upper) > 0) upper = value.clone()
-      if (lower == null || value.compareTo(lower) < 0) lower = value.clone()
-      sizeInBytes += STRING.actualSize(row, ordinal)
+      val size = STRING.actualSize(row, ordinal)


not related, but STRING.actualSize should just take UTF8String

I may not understand your point.
Do you want to use row.getUTF8String(ordinal).numBytes() + 4 instead of calling STRING.actualSize()? (i.e. method inlining).

I mean we can just pass the UTF8String to STRING.actualSize

In STRING.actualSize, we call row.getUTF8String(ordinal), so why not we pass in the UTF8String directly?

Do you want to add the new method STRING.actualSize(s: UTF8String)? The current signature actualSize(row: InternalRow, ordinal: Int) cannot be changed since it is declared at the super class.

ah i see, nvm

cloud-fan · 2017-05-18T03:09:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala


-  override def collectedStatistics: GenericInternalRow =
-    new GenericInternalRow(Array[Any](null, null, nullCount, count, sizeInBytes))
+  def gatherValueStats(size: Int): Unit = {


is this method used?

Sure, removed.

cloud-fan · 2017-05-18T03:10:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

-      if (lower == null || value.compareTo(lower) < 0) lower = value
      // TODO: this is not right for DecimalType with precision > 18
-      sizeInBytes += 8
+      val size = 8


can we just hardcode 8 in gatherValueStats?

Thanks, done

SparkQA · 2017-05-18T17:34:07Z

Test build #77050 has finished for PR 18002 at commit 66fefb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-05-19T14:29:19Z

@cloud-fan and @viirya, thank you for good comments. I am looking forward to merging it into master.

dongjoon-hyun · 2017-05-19T18:03:46Z

+1, LGTM, too.

kiszk · 2017-05-22T00:02:36Z

@cloud-fan could you please let me know if I have to do additional things for this PR?

cloud-fan · 2017-05-22T08:24:24Z

thanks, merging to master!

kiszk · 2017-05-22T10:07:43Z

Thank you very much

## What changes were proposed in this pull request? This PR improves the implementation of `ColumnStats` by using the following appoaches. 1. Declare subclasses of `ColumnStats` as `final` 2. Remove unnecessary call of `row.isNullAt(ordinal)` 3. Remove the dependency on `GenericInternalRow` For 1., this declaration encourages method inlining and other optimizations of JIT compiler For 2., in `gatherStats()`, while previous code in subclasses of `ColumnStats` always calls `row.isNullAt()` twice, the PR just calls `row.isNullAt()` only once. For 3., `collectedStatistics()` returns `Array[Any]` instead of `GenericInternalRow`. This removes the dependency of unnecessary package and reduces the number of allocations of `GenericInternalRow`. In addition to that, in the future, `gatherValueStats()`, which is specialized for each data type, can be effectively called from the generated code without using generic data structure `InternalRow`. ## How was this patch tested? Tested by existing test suite Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#18002 from kiszk/SPARK-20770.

initial commit

11057f4

cloud-fan reviewed May 17, 2017

View reviewed changes

viirya reviewed May 17, 2017

View reviewed changes

address review comment

3e3ffde

cloud-fan reviewed May 18, 2017

View reviewed changes

address review comments

66fefb6

asfgit closed this in 833c8d4 May 22, 2017

Conversation

kiszk commented May 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 16, 2017

Uh oh!

kiszk commented May 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented May 17, 2017

Uh oh!

SparkQA commented May 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 18, 2017

Uh oh!

kiszk commented May 19, 2017

Uh oh!

dongjoon-hyun commented May 19, 2017

Uh oh!

kiszk commented May 22, 2017

Uh oh!

cloud-fan commented May 22, 2017

Uh oh!

kiszk commented May 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants