Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented May 19, 2017

What changes were proposed in this pull request?

This PR adds compression/decompression of column data to ColumnVector.
While current CachedBatch can compress column data by using of multiple compression schemes, ColumnVector cannot compress column data. The compression is mandatory for table cache.

At first, this PR enables the following schemes. Another JIRA will support compression schemes.

  1. RunLengthEncoding for boolean/byte/short/int/long
  2. BooleanBitSet for boolean.

At high level view, when ColumnVector.compress() is called, compression is performed from an array for primitive data type to byte array in ColumnVector. When ColumnVector.decompress() is called, decompression is performed from the byte array to the array for primitive data type to byte array in ColumnVector. ArrayBuffer is used for accessing data during compression or decompression.

This PR added and changed the following APIs:

ArrayBuffer

  • This new class is similar to java.io.ByteBuffer. ArrayBuffer class can wrap an array for any primitive data type such as Array[Int] or Array[Long]. This class manages current position to be accessed.

ColumnType.get(buffer: ArrayBuffer): jvmType, ColumnType.put(buffer: ArrayBuffer)

  • These APIs gets a primitive value from the current position or puts a primitive value into the current position at the given ArrayBuffer.

Encoder.gatherCompressibilityStats(in: ArrayBuffer)

  • This API calculates uncompressed and compressed size by using a given compression method.

Encoder.compress(from: ArrayBuffer, to: ArrayBuffer): Unit

  • This API compresses data in from and stores compressed data to to. to has to have an byte array with enough size for compressed data.

Decoder.decompress(values: ArrayBuffer): Unit

  • This API decompresses data in Decoder by providing its constructor and stores uncompressed data to values. to has to have an byte array with enough size for uncompressed data.

How was this patch tested?

Added new test suites

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77091 has finished for PR 18033 at commit 6d5497e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayBuffer(array: Array[_])
  • class ColumnVectorCompressionBuilder[T <: AtomicType](dataType: T)

@kiszk kiszk changed the title Add compression/decompression of column data to ColumnVector [SPARK-20807][SQL] Add compression/decompression of column data to ColumnVector May 19, 2017
@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77092 has finished for PR 18033 at commit 193a71b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 19, 2017

@hvanhovell Could you please take a look? cc @sameeragarwal

@kiszk
Copy link
Member Author

kiszk commented May 23, 2017

@hvanhovell would it be possible to review this or let us know the appropriate persons for this review?
cc @sameeragarwal

@kiszk
Copy link
Member Author

kiszk commented May 29, 2017

ping @hvanhovell

@kiszk
Copy link
Member Author

kiszk commented Jun 3, 2017

ping @hvanhovell @sameeragarwal

1 similar comment
@kiszk
Copy link
Member Author

kiszk commented Jun 8, 2017

ping @hvanhovell @sameeragarwal

@kiszk kiszk closed this Oct 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants