PARQUET-1328: Add Bloom filter reader and writer #587

chenjunjiedada · 2019-01-09T07:33:02Z

the original pull request is base on master. This one is created against bloom-filter branch.

chenjunjiedada · 2019-01-10T15:43:26Z

Hi @rdblue, @majetideepak , do you have time to have a look on this as well?

chenjunjiedada · 2019-01-14T08:45:53Z

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

+  private byte[] bitset;
+
+  // A integer array buffer of underlying bitset to help setting bits.
+  private IntBuffer intBuffer;


Jim's comments: https://github.com/apache/parquet-mr/pull/521#discussion_r228761288

Just get some time for this. JMeter seems not suit for showing the memory layout for java heap. I wrote a example application and use jmap to dump the heap. The code snippet is like:

byte [] bitset = new byte[1024*1024]; IntBuffer intBuffer = ByteBuffer.wrap(bitset).asIntBuffer(); ByteBuffer byteBuffer = ByteBuffer.allocate(1024*1024); byte[] bitset2 = new byte[1024*1024];

The Eden space in heap dumps after every sentence are:

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 4739624 (4.520057678222656MB)
free = 61320664 (58.479942321777344MB)
7.17469472733755% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 5788216 (5.520072937011719MB)
free = 60272072 (57.47992706298828MB)
8.762020534939236% used

The Eden space does not increase when we call the wrap API of ByteBuffer.

chenjunjiedada · 2019-01-14T08:57:24Z

@jbapple-cloudera, I keep last comment that had not address last time in here. Other comments should all be addressed, it can also be found https://github.com/apache/parquet-mr/pull/521.

gszadovszky

I've only looked through this change. Sorry, if I ask something that was already answered in the former PR, I did not have time to read through everything.

.travis.yml

parquet-column/pom.xml

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java

chenjunjiedada · 2019-01-15T15:48:19Z

Thanks @gszadovszky . These comments are very useful, I will rebase PR to master and update it according to your comments ASAP.

gszadovszky · 2019-01-16T14:44:31Z

The build has failed because you are not allowed to use SNAPSHOT dependencies on the master branch. You will not be able to merge this change to master before having the required changes in a parquet-format release and depending on that new release.

The current bloom-filter feature branch has no changes yet. If you want, I can push the actual master content there so you can re-target this rebased PR to the feature branch.

chenjunjiedada · 2019-01-16T15:33:05Z

Thanks very much. Please help to merge master to feature branch.

So the last commit passed build since it is target feature branch, Now I understood.

gszadovszky · 2019-01-16T15:47:21Z

The branch bloom-filter is now at the current master. You may re-target this PR to bloom-filter.

gszadovszky

LGTM.
Since it is a PR to a feature branch, I'll push soon.

garawalid · 2020-01-01T01:05:40Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java

+
+    if (columnNames.length == expectedNDVs.length) {
+      for (int i = 0; i < columnNames.length; i++) {
+        kv.put(columnNames[i], Long.getLong(expectedNDVs[i]));


@chenjunjiedada
I have a small question about BLOOM_FILTER_EXPECTED_NDV.
Why do we get the system property after setting parquet.bloom.filter.expected.ndv? Wouldn't it be better if we just parse the string with Long.parseLong()?

@garawalid, Usually the compute engine, such as hive, spark, etc., set the properties for parquet so we may not able to get the string directly other than from configuration.

@chenjunjiedada Thanks for the clarification.
In fact, when I pass parquet.bloom.filter.column.names and parquet.bloom.filter.expected.ndv this way :

// Configuration conf = new Configuration(); conf.set("parquet.bloom.filter.column.names", "content,line"); conf.set("parquet.bloom.filter.expected.ndv","1000,200");

I got kv as follows "line" -> null and "content" -> null.

You can reproduce this behavior by adding that configuration to testReadWrite() in TestInputOutputFormat.java.

Do you think that's normal? In my case, I changed Long.getLong(expectedNDVs[i]) by Long.parseLong(expectedNDVs[i]) to build the kv HashMap.

chenjunjiedada · 2020-01-06T02:33:21Z

@garawalid , I can get configuration when putting settings in runMapReduceJob. Where did you put the settings?

garawalid · 2020-01-06T10:27:52Z

@chenjunjiedada
Strange! I add the configuration here: https://github.com/apache/parquet-mr/blob/1fc273329cd6d54085de81ecccfe99be181f24be/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/example/TestInputOutputFormat.java#L234-L240

Configuration:

    conf.set("parquet.bloom.filter.column.names", "content,line");
    conf.set("parquet.bloom.filter.expected.ndv","1000,200");

garawalid · 2020-01-06T21:16:27Z

@chenjunjiedada
I noticed that expectedNDVs[i] gets the string value as expected (which is 1000 in our case). The problem comes when we cast it to Long (via Long.getLong()) which leads to a null.

* PARQUET-1328: Add Bloom filter reader and writer (#587) * PARQUET-1516: Store Bloom filters near to footer (#608) * PARQUET-1391: Integrate Bloom filter logic (#619) * PARQUET-1660: align Bloom filter implementation with format (#686)

* PARQUET-1328: Add Bloom filter reader and writer (apache#587) * PARQUET-1516: Store Bloom filters near to footer (apache#608) * PARQUET-1391: Integrate Bloom filter logic (apache#619) * PARQUET-1660: align Bloom filter implementation with format (apache#686)

gszadovszky mentioned this pull request Jan 14, 2019

PARQUET-1328: Add Bloom filter reader and writer #521

Closed

chenjunjiedada commented Jan 14, 2019

View reviewed changes

gszadovszky requested changes Jan 14, 2019

View reviewed changes

chenjunjiedada force-pushed the bloom-filter branch from 58050ff to ab4a546 Compare January 16, 2019 13:29

chenjunjiedada changed the base branch from bloom-filter to master January 16, 2019 13:31

chenjunjiedada changed the base branch from master to bloom-filter January 16, 2019 15:51

Chen, Junjie and others added 4 commits January 17, 2019 00:37

PARQUET-1328: Add Bloom filter reader and writer

526265c

rebase to master(e9c2837)

bfdc5ba

Address review comments

ab4a546

Add back missing commits

9824de4

chenjunjiedada closed this Jan 17, 2019

chenjunjiedada reopened this Jan 17, 2019

gszadovszky approved these changes Jan 21, 2019

View reviewed changes

gszadovszky merged commit d473d17 into apache:bloom-filter Jan 21, 2019

garawalid reviewed Jan 1, 2020

View reviewed changes

chenjunjiedada added a commit to chenjunjiedada/parquet-mr that referenced this pull request Jan 7, 2020

PARQUET-1328: Add Bloom filter reader and writer (apache#587)

05dceb6

PARQUET-1328: Add Bloom filter reader and writer #587

PARQUET-1328: Add Bloom filter reader and writer #587

Uh oh!

Conversation

chenjunjiedada commented Jan 9, 2019

Uh oh!

chenjunjiedada commented Jan 10, 2019

Uh oh!

chenjunjiedada Jan 14, 2019

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada commented Jan 14, 2019

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenjunjiedada commented Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gszadovszky commented Jan 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenjunjiedada commented Jan 16, 2019

Uh oh!

gszadovszky commented Jan 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gszadovszky left a comment

Choose a reason for hiding this comment

Uh oh!

garawalid Jan 1, 2020

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Jan 2, 2020

Choose a reason for hiding this comment

Uh oh!

garawalid Jan 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada commented Jan 6, 2020

Uh oh!

garawalid commented Jan 6, 2020

Uh oh!

garawalid commented Jan 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chenjunjiedada commented Jan 15, 2019 •

edited

Loading

gszadovszky commented Jan 16, 2019 •

edited

Loading

gszadovszky commented Jan 16, 2019 •

edited

Loading

garawalid Jan 2, 2020 •

edited

Loading