Skip to content

[SPARK-20770][SQL] Improve ColumnStats#18002

Closed
kiszk wants to merge 3 commits intoapache:masterfrom
kiszk:SPARK-20770
Closed

[SPARK-20770][SQL] Improve ColumnStats#18002
kiszk wants to merge 3 commits intoapache:masterfrom
kiszk:SPARK-20770

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented May 16, 2017

What changes were proposed in this pull request?

This PR improves the implementation of ColumnStats by using the following appoaches.

  1. Declare subclasses of ColumnStats as final
  2. Remove unnecessary call of row.isNullAt(ordinal)
  3. Remove the dependency on GenericInternalRow

For 1., this declaration encourages method inlining and other optimizations of JIT compiler
For 2., in gatherStats(), while previous code in subclasses of ColumnStats always calls row.isNullAt() twice, the PR just calls row.isNullAt() only once.
For 3., collectedStatistics() returns Array[Any] instead of GenericInternalRow. This removes the dependency of unnecessary package and reduces the number of allocations of GenericInternalRow.

In addition to that, in the future, gatherValueStats(), which is specialized for each data type, can be effectively called from the generated code without using generic data structure InternalRow.

How was this patch tested?

Tested by existing test suite

@SparkQA
Copy link

SparkQA commented May 16, 2017

Test build #76975 has finished for PR 18002 at commit 11057f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 16, 2017

@hvanhovell would it be possible to take a look?
cc @cloud-fan

if (!row.isNullAt(ordinal)) {
count += 1
} else {
super.gatherNullStats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the super keyword here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. done.

testDecimalColumnStats(createRow(null, null, 0))

def createRow(values: Any*): GenericInternalRow = new GenericInternalRow(values.toArray)
def createRow(values: Any*): Array[Any] = values.toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Eliminated.

@cloud-fan
Copy link
Contributor

LGTM

if (!row.isNullAt(ordinal)) {
sizeInBytes += BINARY.actualSize(row, ordinal)
val size = BINARY.actualSize(row, ordinal)
gatherValueStats(size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we may not need gatherValueStats here. Simply inline:

sizeInBytes += BINARY.actualSize(row, ordinal)
count += 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.


override def collectedStatistics: GenericInternalRow =
new GenericInternalRow(Array[Any](null, null, nullCount, count, sizeInBytes))
def gatherValueStats(size: Int): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can inline this too.

@viirya
Copy link
Member

viirya commented May 17, 2017

LGTM

@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #77029 has finished for PR 18002 at commit 3e3ffde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (upper == null || value.compareTo(upper) > 0) upper = value.clone()
if (lower == null || value.compareTo(lower) < 0) lower = value.clone()
sizeInBytes += STRING.actualSize(row, ordinal)
val size = STRING.actualSize(row, ordinal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related, but STRING.actualSize should just take UTF8String

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not understand your point.
Do you want to use row.getUTF8String(ordinal).numBytes() + 4 instead of calling STRING.actualSize()? (i.e. method inlining).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we can just pass the UTF8String to STRING.actualSize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In STRING.actualSize, we call row.getUTF8String(ordinal), so why not we pass in the UTF8String directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add the new method STRING.actualSize(s: UTF8String)? The current signature actualSize(row: InternalRow, ordinal: Int) cannot be changed since it is declared at the super class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, nvm


override def collectedStatistics: GenericInternalRow =
new GenericInternalRow(Array[Any](null, null, nullCount, count, sizeInBytes))
def gatherValueStats(size: Int): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, removed.

if (lower == null || value.compareTo(lower) < 0) lower = value
// TODO: this is not right for DecimalType with precision > 18
sizeInBytes += 8
val size = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just hardcode 8 in gatherValueStats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done

@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77050 has finished for PR 18002 at commit 66fefb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 19, 2017

@cloud-fan and @viirya, thank you for good comments. I am looking forward to merging it into master.

@dongjoon-hyun
Copy link
Member

+1, LGTM, too.

@kiszk
Copy link
Member Author

kiszk commented May 22, 2017

@cloud-fan could you please let me know if I have to do additional things for this PR?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 833c8d4 May 22, 2017
@kiszk
Copy link
Member Author

kiszk commented May 22, 2017

Thank you very much

lycplus pushed a commit to lycplus/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?

This PR improves the implementation of `ColumnStats` by using the following appoaches.

1. Declare subclasses of `ColumnStats` as `final`
2. Remove unnecessary call of `row.isNullAt(ordinal)`
3. Remove the dependency on `GenericInternalRow`

For 1., this declaration encourages method inlining and other optimizations of JIT compiler
For 2., in `gatherStats()`, while previous code in subclasses of `ColumnStats` always calls `row.isNullAt()` twice, the PR just calls `row.isNullAt()` only once.
For 3., `collectedStatistics()` returns `Array[Any]` instead of `GenericInternalRow`. This removes the dependency of unnecessary package and reduces the number of allocations of `GenericInternalRow`.

In addition to that, in the future, `gatherValueStats()`, which is specialized for each data type, can be effectively called from the generated code without using generic data structure `InternalRow`.

## How was this patch tested?

Tested by existing test suite

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes apache#18002 from kiszk/SPARK-20770.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants