[SPARK-20783][SQL] Create ColumnVector to abstract existing compressed column (batch method) #18704

kiszk · 2017-07-21T14:24:03Z

What changes were proposed in this pull request?

This PR abstracts data compressed by CompressibleColumnAccessor using ColumnVector in batch method. When ColumnAccessor.decompress is called, ColumnVector will have uncompressed data. This batch decompress does not use InternalRow to reduce the number of memory accesses.

As first step of this implementation, this JIRA supports primitive data types. Another PR will support array and other data types.

This implementation decompress data in batch into uncompressed column batch, as @rxin suggested at here. Another implementation uses adapter approach as @cloud-fan suggested.

How was this patch tested?

Added test suites

SparkQA · 2017-07-21T14:34:22Z

Test build #79837 has finished for PR 18704 at commit c09f05f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-21T14:49:17Z

Test build #79838 has finished for PR 18704 at commit ec368d8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class CachedBatchColumnVector extends ReadOnlyColumnVector

SparkQA · 2017-07-21T16:29:33Z

Test build #79839 has finished for PR 18704 at commit d6e8fef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-07-21T17:08:54Z

@rxin Could you please review this PR? This is the batch approach that you suggested in here.
Regarding the test failure, this is the issue only in test suite.

SparkQA · 2017-07-21T19:18:58Z

Test build #79843 has finished for PR 18704 at commit bd0c334.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-07-23T02:49:28Z

ping @rxin

kiszk · 2017-07-26T16:04:49Z

ping @rxin

kiszk · 2017-07-31T18:14:49Z

@rxin Could you please review this PR?

SparkQA · 2017-08-22T05:49:14Z

Test build #80958 has finished for PR 18704 at commit a24a971.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T06:14:12Z

Test build #80961 has finished for PR 18704 at commit 6367a4c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T07:04:49Z

Test build #80964 has finished for PR 18704 at commit 9c8960b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-22T08:33:34Z

retest this please

SparkQA · 2017-08-22T10:20:25Z

Test build #80978 has finished for PR 18704 at commit 9c8960b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T13:36:12Z

Test build #80982 has finished for PR 18704 at commit 8f542b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-22T14:12:38Z

@cloud-fan I updated this implementation by using ColumnVector, as we discussed. I would appreciate it if you could discuss two implementations (on-demand approach) with @rxin.
cc @ueshin

cloud-fan · 2017-08-24T15:09:50Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

now we can move them to WritableColumnVector

Got it. Rebased in my local version.

cloud-fan · 2017-08-24T15:13:11Z

.../src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala

can we delay the decompression and set the dictionary to ColumnVector?

Sure, I will do that.

SparkQA · 2017-08-24T19:11:07Z

Test build #81093 has finished for PR 18704 at commit fb0d4e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ColumnDictionary implements Dictionary

kiszk · 2017-08-25T00:10:13Z

@cloud-fan could you please review this again?

kiszk · 2017-08-31T03:03:13Z

ping @cloud-fan

SparkQA · 2017-08-31T22:06:55Z

Test build #81295 has finished for PR 18704 at commit 097fc05.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ColumnDictionary implements Dictionary

kiszk · 2017-09-01T02:52:41Z

@cloud-fan Resolved conflict, could you please review?

kiszk · 2017-09-06T03:32:31Z

ping @cloud-fan

kiszk · 2017-09-11T06:30:02Z

ping @cloud-fan

cloud-fan · 2017-09-12T14:24:17Z

sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java

is it possible to avoid boxing here? e.g. we can have a lot of primitive array members.

Yeah, I removed boxing.

cloud-fan · 2017-09-12T14:26:06Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java

This description is a little vague, as the input data is byte[]. Can we say more about this? e.g. endianness.

@ueshin Line 145 may make a mistake in comment Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
It should be Sets values from [src + srcIndex, src + srcIndex + count) to [rowId, rowId + count)

What do you think?
If we need to update, should we update them in this PR? Or, is it better to create another PR?

let's update them in this PR. BTW WritableColumnVector may be exposed to end users, so that they can build columnar batch to data source v2 columnar scan, so the document is very important.

cloud-fan · 2017-09-12T14:29:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala

typo? ordinal?

good catch, done

cloud-fan · 2017-09-12T14:30:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala

If we need to throw exception at last, why not do it at the beginning?

thanks, fixed.

cloud-fan · 2017-09-12T14:34:04Z

.../src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala

hmmm, is there any way to reduce the code duplication? maybe codegen?

Removed code duplication by using a function object. How about this?

cloud-fan · 2017-10-03T15:03:02Z

...est/scala/org/apache/spark/sql/execution/columnar/compression/PassThroughEncodingSuite.scala

nit: indention is wrong here.

cloud-fan · 2017-10-03T15:03:37Z

.../test/scala/org/apache/spark/sql/execution/columnar/compression/RunLengthEncodingSuite.scala

cloud-fan · 2017-10-03T15:05:14Z

retest this please

cloud-fan · 2017-10-03T15:05:27Z

LGTM, pending jenkins

SparkQA · 2017-10-03T15:14:01Z

Test build #82420 has finished for PR 18704 at commit 549b10f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-10-03T15:21:37Z

I will rebase this next a few hours.

revert unexpected style change

SparkQA · 2017-10-03T20:23:10Z

Test build #82426 has finished for PR 18704 at commit c16230d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PassThroughSuite extends SparkFunSuite

kiszk · 2017-10-04T00:35:31Z

@cloud-fan merged with the latest master and addressed your comment for indent

cloud-fan · 2017-10-04T07:06:43Z

thanks, merging to master!

kiszk mentioned this pull request Jul 21, 2017

[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column #18468

Closed

kiszk mentioned this pull request Jul 27, 2017

[SPARK-20822][SQL] Generate code to directly get value from ColumnVector for table cache #18747

Closed

kiszk changed the title ~~[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column (batch method)~~ [SPARK-20783][SQL] Create ColumnVector to abstract existing compressed column (batch method) Aug 22, 2017

cloud-fan reviewed Aug 24, 2017

View reviewed changes

kiszk force-pushed the SPARK-20783a branch from 8f542b0 to fb0d4e5 Compare August 24, 2017 16:33

kiszk force-pushed the SPARK-20783a branch from fb0d4e5 to 097fc05 Compare August 31, 2017 19:31

cloud-fan reviewed Sep 12, 2017

View reviewed changes

cloud-fan reviewed Oct 3, 2017

View reviewed changes

.../test/scala/org/apache/spark/sql/execution/columnar/compression/RunLengthEncodingSuite.scala Outdated

Copy link

Contributor

cloud-fan Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

kiszk added 17 commits October 4, 2017 00:30

initial commit for batch implementation as @rxin suggested

1c75f1d

add missing files

73e5aee

fix scala style error

591a358

fix test failure of DictionaryEncodingSuite

eb879ac

add new APIs for adding values from a byte array

10951d2

Use ColumnVector for ColumnAccessor

5efbd2e

fix scala type error

ff1ca23

fix test failure of RunLengthEncoding

9c4e1e0

rebase with master

55abc6f

Delay decompress for DictionaryEncoding

133375d

address review comments

4ef81e5

remove unused import

9fa1b28

revert unexpected style change

reduce code duplication

4cb6823

update comments

fc12bd8

address review comment

9dd58d3

fix scala style error

e9606df

address review comment

c16230d

kiszk force-pushed the SPARK-20783a branch from 549b10f to c16230d Compare October 3, 2017 17:38

asfgit closed this in 64df08b Oct 4, 2017

kiszk mentioned this pull request Oct 18, 2017

[SPARK-20783][SQL][Follow-up] Create ColumnVector to abstract existing compressed column #19508

Closed

[SPARK-20783][SQL] Create ColumnVector to abstract existing compressed column (batch method) #18704

[SPARK-20783][SQL] Create ColumnVector to abstract existing compressed column (batch method) #18704

Uh oh!

Conversation

kiszk commented Jul 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

kiszk commented Jul 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

kiszk commented Jul 23, 2017

Uh oh!

kiszk commented Jul 26, 2017

Uh oh!

kiszk commented Jul 31, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

kiszk commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

kiszk commented Aug 22, 2017

Uh oh!

cloud-fan Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 24, 2017

Uh oh!

kiszk commented Aug 25, 2017

Uh oh!

kiszk commented Aug 31, 2017

Uh oh!

SparkQA commented Aug 31, 2017

Uh oh!

kiszk commented Sep 1, 2017

Uh oh!

kiszk commented Sep 6, 2017

Uh oh!

kiszk commented Sep 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Sep 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk commented Jul 21, 2017 •

edited

Loading

kiszk commented Jul 21, 2017 •

edited

Loading

cloud-fan Aug 24, 2017 •

edited

Loading

kiszk Sep 12, 2017 •

edited

Loading