-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1328: Add Bloom filter reader and writer #587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @rdblue, @majetideepak , do you have time to have a look on this as well? |
| private byte[] bitset; | ||
|
|
||
| // A integer array buffer of underlying bitset to help setting bits. | ||
| private IntBuffer intBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jim's comments: https://github.com/apache/parquet-mr/pull/521#discussion_r228761288
Just get some time for this. JMeter seems not suit for showing the memory layout for java heap. I wrote a example application and use jmap to dump the heap. The code snippet is like:
byte [] bitset = new byte[1024*1024];
IntBuffer intBuffer = ByteBuffer.wrap(bitset).asIntBuffer();
ByteBuffer byteBuffer = ByteBuffer.allocate(1024*1024);
byte[] bitset2 = new byte[1024*1024];The Eden space in heap dumps after every sentence are:
Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used
Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used
Eden Space:
capacity = 66060288 (63.0MB)
used = 4739624 (4.520057678222656MB)
free = 61320664 (58.479942321777344MB)
7.17469472733755% used
Eden Space:
capacity = 66060288 (63.0MB)
used = 5788216 (5.520072937011719MB)
free = 60272072 (57.47992706298828MB)
8.762020534939236% used
The Eden space does not increase when we call the wrap API of ByteBuffer.
|
@jbapple-cloudera, I keep last comment that had not address last time in here. Other comments should all be addressed, it can also be found https://github.com/apache/parquet-mr/pull/521. |
gszadovszky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only looked through this change. Sorry, if I ask something that was already answered in the former PR, I did not have time to read through everything.
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java
Outdated
Show resolved
Hide resolved
|
Thanks @gszadovszky . These comments are very useful, I will rebase PR to master and update it according to your comments ASAP. |
58050ff to
ab4a546
Compare
|
The build has failed because you are not allowed to use SNAPSHOT dependencies on the master branch. You will not be able to merge this change to master before having the required changes in a parquet-format release and depending on that new release. The current bloom-filter feature branch has no changes yet. If you want, I can push the actual master content there so you can re-target this rebased PR to the feature branch. |
|
Thanks very much. Please help to merge master to feature branch. So the last commit passed build since it is target feature branch, Now I understood. |
|
The branch |
gszadovszky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Since it is a PR to a feature branch, I'll push soon.
|
|
||
| if (columnNames.length == expectedNDVs.length) { | ||
| for (int i = 0; i < columnNames.length; i++) { | ||
| kv.put(columnNames[i], Long.getLong(expectedNDVs[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenjunjiedada
I have a small question about BLOOM_FILTER_EXPECTED_NDV.
Why do we get the system property after setting parquet.bloom.filter.expected.ndv? Wouldn't it be better if we just parse the string with Long.parseLong()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garawalid, Usually the compute engine, such as hive, spark, etc., set the properties for parquet so we may not able to get the string directly other than from configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenjunjiedada Thanks for the clarification.
In fact, when I pass parquet.bloom.filter.column.names and parquet.bloom.filter.expected.ndv this way :
// Configuration conf = new Configuration();
conf.set("parquet.bloom.filter.column.names", "content,line");
conf.set("parquet.bloom.filter.expected.ndv","1000,200");I got kv as follows "line" -> null and "content" -> null.
You can reproduce this behavior by adding that configuration to testReadWrite() in TestInputOutputFormat.java.
Do you think that's normal? In my case, I changed Long.getLong(expectedNDVs[i]) by Long.parseLong(expectedNDVs[i]) to build the kv HashMap.
|
@garawalid , I can get configuration when putting settings in |
|
@chenjunjiedada Configuration: conf.set("parquet.bloom.filter.column.names", "content,line");
conf.set("parquet.bloom.filter.expected.ndv","1000,200"); |
|
@chenjunjiedada |
* PARQUET-1328: Add Bloom filter reader and writer (apache#587) * PARQUET-1516: Store Bloom filters near to footer (apache#608) * PARQUET-1391: Integrate Bloom filter logic (apache#619) * PARQUET-1660: align Bloom filter implementation with format (apache#686)
the original pull request is base on master. This one is created against bloom-filter branch.