-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-160: avoid wasting 64K per empty buffer. #98
PARQUET-160: avoid wasting 64K per empty buffer. #98
Conversation
this.path = path; | ||
this.pageWriter = pageWriter; | ||
resetStatistics(); | ||
this.repetitionLevelColumn = new RunLengthBitPackingHybridEncoder(getWidthFromMaxInt(path.getMaxRepetitionLevel()), initialSizePerCol); | ||
this.definitionLevelColumn = new RunLengthBitPackingHybridEncoder(getWidthFromMaxInt(path.getMaxDefinitionLevel()), initialSizePerCol); | ||
this.dataColumn = parquetProps.getValuesWriter(path, initialSizePerCol); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should tweak the initialSize here.
levels should get a tiny initial size (100 bytes?) in case they are always null or always defined.
The initial size here should be tweaked as well to something smaller: |
int nextSlabSize; | ||
if (size == 0) { | ||
nextSlabSize = initialSize; | ||
} else if (size > pageSize / 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should 5 be configurable too?
we could also make CapacityByteArrayOutputStream abstract or take as an argument a slab size calculator etc. so that we can plug in different behaviors here. what do you think?
Do you want to tweak the initial size here as well? |
@julienledem ping! |
… a simpler heuristic in the column writers instead
Sent a PR against this PR here: julienledem#2 |
…onaryValuesWriter as well
Updates to PR-98
@tsdeng ok, this PR is now ready to review, it's got both @julienledem's changes and mine as well. |
@@ -40,6 +42,7 @@ | |||
|
|||
class ColumnChunkPageWriteStore implements PageWriteStore { | |||
private static final Log LOG = Log.getLog(ColumnChunkPageWriteStore.class); | |||
private static final int COLUMN_CHUNK_WRITER_MAX_SIZE_HINT = 64 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm this, not used.
…nledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer
Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java parquet-hadoop/src/test/java/parquet/hadoop/TestColumnChunkPageWriteStore.java
+1, lets merge when the tests are green |
I'm running these tests here: in case we have to wait a long time for the travis CI apache queue. |
Tests passed! merging now... |
This buffer initializes itself to a default size when instantiated. This leads to a lot of unused small buffers when there are a lot of empty columns. Author: Alex Levenson <[email protected]> Author: julien <[email protected]> Author: Julien Le Dem <[email protected]> Closes apache#98 from julienledem/avoid_wasting_64K_per_empty_buffer and squashes the following commits: b0200dd [julien] add license a1b278e [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer 5304ee1 [julien] remove unused constant 81e399f [julien] Merge branch 'avoid_wasting_64K_per_empty_buffer' of github.com:julienledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer ccf677d [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer 37148d6 [Julien Le Dem] Merge pull request #2 from isnotinvain/PR-98 b9abab0 [Alex Levenson] Address Julien's comment 965af7f [Alex Levenson] one more typo 9939d8d [Alex Levenson] fix typos in comments 61c0100 [Alex Levenson] Make initial slab size heuristic into a helper method, apply in DictionaryValuesWriter as well a257ee4 [Alex Levenson] Improve IndexOutOfBoundsException message 64d6c7f [Alex Levenson] update comments 8b54667 [Alex Levenson] Don't use CapacityByteArrayOutputStream for writing page chunks 6a20e8b [Alex Levenson] Remove initialSlabSize decision from InternalParquetRecordReader, use a simpler heuristic in the column writers instead 3a0f8e4 [Alex Levenson] Use simpler settings for column chunk writer b2736a1 [Alex Levenson] Some cleanup in CapacityByteArrayOutputStream 1df4a71 [julien] refactor CapacityByteArray to be aware of page size 95c8fb6 [julien] avoid wasting 64K per empty buffer.
This buffer initializes itself to a default size when instantiated.
This leads to a lot of unused small buffers when there are a lot of empty columns.