[SPARK-20807][SQL] Add compression/decompression of column data to ColumnVector #18033

kiszk · 2017-05-19T09:36:09Z

What changes were proposed in this pull request?

This PR adds compression/decompression of column data to ColumnVector.
While current CachedBatch can compress column data by using of multiple compression schemes, ColumnVector cannot compress column data. The compression is mandatory for table cache.

At first, this PR enables the following schemes. Another JIRA will support compression schemes.

RunLengthEncoding for boolean/byte/short/int/long
BooleanBitSet for boolean.

At high level view, when ColumnVector.compress() is called, compression is performed from an array for primitive data type to byte array in ColumnVector. When ColumnVector.decompress() is called, decompression is performed from the byte array to the array for primitive data type to byte array in ColumnVector. ArrayBuffer is used for accessing data during compression or decompression.

This PR added and changed the following APIs:

ArrayBuffer

This new class is similar to java.io.ByteBuffer. ArrayBuffer class can wrap an array for any primitive data type such as Array[Int] or Array[Long]. This class manages current position to be accessed.

ColumnType.get(buffer: ArrayBuffer): jvmType, ColumnType.put(buffer: ArrayBuffer)

These APIs gets a primitive value from the current position or puts a primitive value into the current position at the given ArrayBuffer.

Encoder.gatherCompressibilityStats(in: ArrayBuffer)

This API calculates uncompressed and compressed size by using a given compression method.

Encoder.compress(from: ArrayBuffer, to: ArrayBuffer): Unit

This API compresses data in from and stores compressed data to to. to has to have an byte array with enough size for compressed data.

Decoder.decompress(values: ArrayBuffer): Unit

This API decompresses data in Decoder by providing its constructor and stores uncompressed data to values. to has to have an byte array with enough size for uncompressed data.

How was this patch tested?

Added new test suites

SparkQA · 2017-05-19T09:42:41Z

Test build #77091 has finished for PR 18033 at commit 6d5497e.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrayBuffer(array: Array[_])
class ColumnVectorCompressionBuilder[T <: AtomicType](dataType: T)

SparkQA · 2017-05-19T12:12:21Z

Test build #77092 has finished for PR 18033 at commit 193a71b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-05-19T12:22:10Z

@hvanhovell Could you please take a look? cc @sameeragarwal

kiszk · 2017-05-23T04:52:33Z

@hvanhovell would it be possible to review this or let us know the appropriate persons for this review?
cc @sameeragarwal

kiszk · 2017-05-29T04:24:38Z

ping @hvanhovell

kiszk · 2017-06-03T16:34:18Z

ping @hvanhovell @sameeragarwal

kiszk · 2017-06-08T16:17:28Z

ping @hvanhovell @sameeragarwal

initial commit

6d5497e

kiszk added 2 commits May 19, 2017 18:45

expose compress/decompress APIs at ColumnVector

dfe253a

fix build error

193a71b

kiszk changed the title ~~Add compression/decompression of column data to ColumnVector~~ [SPARK-20807][SQL] Add compression/decompression of column data to ColumnVector May 19, 2017

kiszk mentioned this pull request May 21, 2017

[SPARK-17915][SQL] Prepare a new ColumnVector implementation for UnsafeData #15468

Closed

kiszk closed this Oct 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-20807][SQL] Add compression/decompression of column data to ColumnVector #18033

[SPARK-20807][SQL] Add compression/decompression of column data to ColumnVector #18033

Uh oh!

kiszk commented May 19, 2017 •

edited

Loading

Uh oh!

SparkQA commented May 19, 2017

Uh oh!

SparkQA commented May 19, 2017

Uh oh!

kiszk commented May 19, 2017

Uh oh!

kiszk commented May 23, 2017

Uh oh!

kiszk commented May 29, 2017

Uh oh!

kiszk commented Jun 3, 2017

Uh oh!

kiszk commented Jun 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-20807][SQL] Add compression/decompression of column data to ColumnVector #18033

[SPARK-20807][SQL] Add compression/decompression of column data to ColumnVector #18033

Uh oh!

Conversation

kiszk commented May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 19, 2017

Uh oh!

SparkQA commented May 19, 2017

Uh oh!

kiszk commented May 19, 2017

Uh oh!

kiszk commented May 23, 2017

Uh oh!

kiszk commented May 29, 2017

Uh oh!

kiszk commented Jun 3, 2017

Uh oh!

kiszk commented Jun 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kiszk commented May 19, 2017 •

edited

Loading