PARQUET-160: avoid wasting 64K per empty buffer. #98

julienledem · 2015-01-06T23:38:07Z

This buffer initializes itself to a default size when instantiated.
This leads to a lot of unused small buffers when there are a lot of empty columns.

julienledem · 2015-01-08T19:54:11Z

parquet-column/src/main/java/parquet/column/impl/ColumnWriterV2.java

    this.path = path;
    this.pageWriter = pageWriter;
    resetStatistics();
-    this.repetitionLevelColumn = new RunLengthBitPackingHybridEncoder(getWidthFromMaxInt(path.getMaxRepetitionLevel()), initialSizePerCol);
-    this.definitionLevelColumn = new RunLengthBitPackingHybridEncoder(getWidthFromMaxInt(path.getMaxDefinitionLevel()), initialSizePerCol);
-    this.dataColumn = parquetProps.getValuesWriter(path, initialSizePerCol);


We should tweak the initialSize here.
levels should get a tiny initial size (100 bytes?) in case they are always null or always defined.

julienledem · 2015-01-08T19:57:06Z

The initial size here should be tweaked as well to something smaller:
https://github.com/julienledem/incubator-parquet-mr/blob/1df4a71d5a8f6e7c0adae142ce16bfccd34de999/parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordWriter.java#L111

isnotinvain · 2015-01-29T01:28:50Z

parquet-encoding/src/main/java/parquet/bytes/CapacityByteArrayOutputStream.java

+    int nextSlabSize;
+    if (size == 0) {
+      nextSlabSize = initialSize;
+    } else if (size > pageSize / 5) {


should 5 be configurable too?

we could also make CapacityByteArrayOutputStream abstract or take as an argument a slab size calculator etc. so that we can plug in different behaviors here. what do you think?

isnotinvain · 2015-01-29T01:40:08Z

Do you want to tweak the initial size here as well?
Do you think we should try this out internally and see if it's an improvement first?

isnotinvain · 2015-02-05T03:06:11Z

@julienledem ping!

… a simpler heuristic in the column writers instead

isnotinvain · 2015-02-21T04:44:10Z

Sent a PR against this PR here: julienledem#2

…onaryValuesWriter as well

Updates to PR-98

isnotinvain · 2015-02-28T22:13:37Z

@tsdeng ok, this PR is now ready to review, it's got both @julienledem's changes and mine as well.

isnotinvain · 2015-02-28T22:17:32Z

parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java

@@ -40,6 +42,7 @@

 class ColumnChunkPageWriteStore implements PageWriteStore {
  private static final Log LOG = Log.getLog(ColumnChunkPageWriteStore.class);
+  private static final int COLUMN_CHUNK_WRITER_MAX_SIZE_HINT = 64 * 1024;


rm this, not used.

…nledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer

Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java parquet-hadoop/src/test/java/parquet/hadoop/TestColumnChunkPageWriteStore.java

isnotinvain · 2015-03-05T00:25:17Z

+1, lets merge when the tests are green

isnotinvain · 2015-03-05T00:41:29Z

I'm running these tests here:
https://github.com/isnotinvain/incubator-parquet-mr/pull/2

in case we have to wait a long time for the travis CI apache queue.

isnotinvain · 2015-03-05T01:25:48Z

Tests passed! merging now...

This buffer initializes itself to a default size when instantiated. This leads to a lot of unused small buffers when there are a lot of empty columns. Author: Alex Levenson <[email protected]> Author: julien <[email protected]> Author: Julien Le Dem <[email protected]> Closes apache#98 from julienledem/avoid_wasting_64K_per_empty_buffer and squashes the following commits: b0200dd [julien] add license a1b278e [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer 5304ee1 [julien] remove unused constant 81e399f [julien] Merge branch 'avoid_wasting_64K_per_empty_buffer' of github.com:julienledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer ccf677d [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer 37148d6 [Julien Le Dem] Merge pull request #2 from isnotinvain/PR-98 b9abab0 [Alex Levenson] Address Julien's comment 965af7f [Alex Levenson] one more typo 9939d8d [Alex Levenson] fix typos in comments 61c0100 [Alex Levenson] Make initial slab size heuristic into a helper method, apply in DictionaryValuesWriter as well a257ee4 [Alex Levenson] Improve IndexOutOfBoundsException message 64d6c7f [Alex Levenson] update comments 8b54667 [Alex Levenson] Don't use CapacityByteArrayOutputStream for writing page chunks 6a20e8b [Alex Levenson] Remove initialSlabSize decision from InternalParquetRecordReader, use a simpler heuristic in the column writers instead 3a0f8e4 [Alex Levenson] Use simpler settings for column chunk writer b2736a1 [Alex Levenson] Some cleanup in CapacityByteArrayOutputStream 1df4a71 [julien] refactor CapacityByteArray to be aware of page size 95c8fb6 [julien] avoid wasting 64K per empty buffer.

julienledem added 2 commits January 6, 2015 15:34

avoid wasting 64K per empty buffer.

95c8fb6

refactor CapacityByteArray to be aware of page size

1df4a71

julienledem reviewed Jan 8, 2015
View reviewed changes

isnotinvain reviewed Jan 29, 2015
View reviewed changes

isnotinvain added 4 commits February 20, 2015 16:37

Some cleanup in CapacityByteArrayOutputStream

b2736a1

Use simpler settings for column chunk writer

3a0f8e4

Remove initialSlabSize decision from InternalParquetRecordReader, use…

6a20e8b

… a simpler heuristic in the column writers instead

Don't use CapacityByteArrayOutputStream for writing page chunks

8b54667

isnotinvain and others added 7 commits February 20, 2015 20:46

update comments

64d6c7f

Improve IndexOutOfBoundsException message

a257ee4

Make initial slab size heuristic into a helper method, apply in Dicti…

61c0100

…onaryValuesWriter as well

fix typos in comments

9939d8d

one more typo

965af7f

Address Julien's comment

b9abab0

Merge pull request #2 from isnotinvain/PR-98

37148d6

Updates to PR-98

isnotinvain reviewed Feb 28, 2015
View reviewed changes

julienledem added 5 commits March 4, 2015 13:53

Merge branch 'master' into avoid_wasting_64K_per_empty_buffer

ccf677d

Merge branch 'avoid_wasting_64K_per_empty_buffer' of github.com:julie…

81e399f

…nledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer

remove unused constant

5304ee1

Merge branch 'master' into avoid_wasting_64K_per_empty_buffer

a1b278e

Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java parquet-hadoop/src/test/java/parquet/hadoop/TestColumnChunkPageWriteStore.java

add license

b0200dd

asfgit closed this in d084ad2 Mar 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-160: avoid wasting 64K per empty buffer. #98

PARQUET-160: avoid wasting 64K per empty buffer. #98

julienledem commented Jan 6, 2015

julienledem Jan 8, 2015

julienledem commented Jan 8, 2015

isnotinvain Jan 29, 2015

isnotinvain commented Jan 29, 2015

isnotinvain commented Feb 5, 2015

isnotinvain commented Feb 21, 2015

isnotinvain commented Feb 28, 2015

isnotinvain Feb 28, 2015

isnotinvain commented Mar 5, 2015

isnotinvain commented Mar 5, 2015

isnotinvain commented Mar 5, 2015

PARQUET-160: avoid wasting 64K per empty buffer. #98

PARQUET-160: avoid wasting 64K per empty buffer. #98

Conversation

julienledem commented Jan 6, 2015

julienledem Jan 8, 2015

Choose a reason for hiding this comment

julienledem commented Jan 8, 2015

isnotinvain Jan 29, 2015

Choose a reason for hiding this comment

isnotinvain commented Jan 29, 2015

isnotinvain commented Feb 5, 2015

isnotinvain commented Feb 21, 2015

isnotinvain commented Feb 28, 2015

isnotinvain Feb 28, 2015

Choose a reason for hiding this comment

isnotinvain commented Mar 5, 2015

isnotinvain commented Mar 5, 2015

isnotinvain commented Mar 5, 2015