Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-160: avoid wasting 64K per empty buffer. #98

Closed

Conversation

julienledem
Copy link
Member

This buffer initializes itself to a default size when instantiated.
This leads to a lot of unused small buffers when there are a lot of empty columns.

this.path = path;
this.pageWriter = pageWriter;
resetStatistics();
this.repetitionLevelColumn = new RunLengthBitPackingHybridEncoder(getWidthFromMaxInt(path.getMaxRepetitionLevel()), initialSizePerCol);
this.definitionLevelColumn = new RunLengthBitPackingHybridEncoder(getWidthFromMaxInt(path.getMaxDefinitionLevel()), initialSizePerCol);
this.dataColumn = parquetProps.getValuesWriter(path, initialSizePerCol);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should tweak the initialSize here.
levels should get a tiny initial size (100 bytes?) in case they are always null or always defined.

@julienledem
Copy link
Member Author

int nextSlabSize;
if (size == 0) {
nextSlabSize = initialSize;
} else if (size > pageSize / 5) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should 5 be configurable too?

we could also make CapacityByteArrayOutputStream abstract or take as an argument a slab size calculator etc. so that we can plug in different behaviors here. what do you think?

@isnotinvain
Copy link
Contributor

Do you want to tweak the initial size here as well?
Do you think we should try this out internally and see if it's an improvement first?

@isnotinvain
Copy link
Contributor

@julienledem ping!

@isnotinvain
Copy link
Contributor

Sent a PR against this PR here: julienledem#2

@isnotinvain
Copy link
Contributor

@tsdeng ok, this PR is now ready to review, it's got both @julienledem's changes and mine as well.

@@ -40,6 +42,7 @@

class ColumnChunkPageWriteStore implements PageWriteStore {
private static final Log LOG = Log.getLog(ColumnChunkPageWriteStore.class);
private static final int COLUMN_CHUNK_WRITER_MAX_SIZE_HINT = 64 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm this, not used.

…nledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer
Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageWriteStore.java
	parquet-hadoop/src/test/java/parquet/hadoop/TestColumnChunkPageWriteStore.java
@isnotinvain
Copy link
Contributor

+1, lets merge when the tests are green

@isnotinvain
Copy link
Contributor

I'm running these tests here:
https://github.com/isnotinvain/incubator-parquet-mr/pull/2

in case we have to wait a long time for the travis CI apache queue.

@isnotinvain
Copy link
Contributor

Tests passed! merging now...

@asfgit asfgit closed this in d084ad2 Mar 5, 2015
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
This buffer initializes itself to a default size when instantiated.
This leads to a lot of unused small buffers when there are a lot of empty columns.

Author: Alex Levenson <[email protected]>
Author: julien <[email protected]>
Author: Julien Le Dem <[email protected]>

Closes apache#98 from julienledem/avoid_wasting_64K_per_empty_buffer and squashes the following commits:

b0200dd [julien] add license
a1b278e [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer
5304ee1 [julien] remove unused constant
81e399f [julien] Merge branch 'avoid_wasting_64K_per_empty_buffer' of github.com:julienledem/incubator-parquet-mr into avoid_wasting_64K_per_empty_buffer
ccf677d [julien] Merge branch 'master' into avoid_wasting_64K_per_empty_buffer
37148d6 [Julien Le Dem] Merge pull request #2 from isnotinvain/PR-98
b9abab0 [Alex Levenson] Address Julien's comment
965af7f [Alex Levenson] one more typo
9939d8d [Alex Levenson] fix typos in comments
61c0100 [Alex Levenson] Make initial slab size heuristic into a helper method, apply in DictionaryValuesWriter as well
a257ee4 [Alex Levenson] Improve IndexOutOfBoundsException message
64d6c7f [Alex Levenson] update comments
8b54667 [Alex Levenson] Don't use CapacityByteArrayOutputStream for writing page chunks
6a20e8b [Alex Levenson] Remove initialSlabSize decision from InternalParquetRecordReader, use a simpler heuristic in the column writers instead
3a0f8e4 [Alex Levenson] Use simpler settings for column chunk writer
b2736a1 [Alex Levenson] Some cleanup in CapacityByteArrayOutputStream
1df4a71 [julien] refactor CapacityByteArray to be aware of page size
95c8fb6 [julien] avoid wasting 64K per empty buffer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants