Skip to content

Conversation

@chenjunjiedada
Copy link
Contributor

the original pull request is base on master. This one is created against bloom-filter branch.

@chenjunjiedada
Copy link
Contributor Author

Hi @rdblue, @majetideepak , do you have time to have a look on this as well?

private byte[] bitset;

// A integer array buffer of underlying bitset to help setting bits.
private IntBuffer intBuffer;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jim's comments: https://github.com/apache/parquet-mr/pull/521#discussion_r228761288

Just get some time for this. JMeter seems not suit for showing the memory layout for java heap. I wrote a example application and use jmap to dump the heap. The code snippet is like:

        byte [] bitset = new byte[1024*1024];
        IntBuffer intBuffer = ByteBuffer.wrap(bitset).asIntBuffer();
        ByteBuffer byteBuffer = ByteBuffer.allocate(1024*1024);
        byte[] bitset2 = new byte[1024*1024];

The Eden space in heap dumps after every sentence are:

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 4739624 (4.520057678222656MB)
free = 61320664 (58.479942321777344MB)
7.17469472733755% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 5788216 (5.520072937011719MB)
free = 60272072 (57.47992706298828MB)
8.762020534939236% used

The Eden space does not increase when we call the wrap API of ByteBuffer.

@chenjunjiedada
Copy link
Contributor Author

@jbapple-cloudera, I keep last comment that had not address last time in here. Other comments should all be addressed, it can also be found https://github.com/apache/parquet-mr/pull/521.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only looked through this change. Sorry, if I ask something that was already answered in the former PR, I did not have time to read through everything.

@chenjunjiedada
Copy link
Contributor Author

chenjunjiedada commented Jan 15, 2019

Thanks @gszadovszky . These comments are very useful, I will rebase PR to master and update it according to your comments ASAP.

@chenjunjiedada chenjunjiedada changed the base branch from bloom-filter to master January 16, 2019 13:31
@gszadovszky
Copy link
Contributor

gszadovszky commented Jan 16, 2019

The build has failed because you are not allowed to use SNAPSHOT dependencies on the master branch. You will not be able to merge this change to master before having the required changes in a parquet-format release and depending on that new release.

The current bloom-filter feature branch has no changes yet. If you want, I can push the actual master content there so you can re-target this rebased PR to the feature branch.

@chenjunjiedada
Copy link
Contributor Author

Thanks very much. Please help to merge master to feature branch.

So the last commit passed build since it is target feature branch, Now I understood.

@gszadovszky
Copy link
Contributor

gszadovszky commented Jan 16, 2019

The branch bloom-filter is now at the current master. You may re-target this PR to bloom-filter.

@chenjunjiedada chenjunjiedada changed the base branch from master to bloom-filter January 16, 2019 15:51
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Since it is a PR to a feature branch, I'll push soon.

@gszadovszky gszadovszky merged commit d473d17 into apache:bloom-filter Jan 21, 2019

if (columnNames.length == expectedNDVs.length) {
for (int i = 0; i < columnNames.length; i++) {
kv.put(columnNames[i], Long.getLong(expectedNDVs[i]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada
I have a small question about BLOOM_FILTER_EXPECTED_NDV.
Why do we get the system property after setting parquet.bloom.filter.expected.ndv? Wouldn't it be better if we just parse the string with Long.parseLong()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garawalid, Usually the compute engine, such as hive, spark, etc., set the properties for parquet so we may not able to get the string directly other than from configuration.

Copy link
Contributor

@garawalid garawalid Jan 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada Thanks for the clarification.
In fact, when I pass parquet.bloom.filter.column.names and parquet.bloom.filter.expected.ndv this way :

    // Configuration conf = new Configuration();
    conf.set("parquet.bloom.filter.column.names", "content,line");
    conf.set("parquet.bloom.filter.expected.ndv","1000,200");

I got kv as follows "line" -> null and "content" -> null.

You can reproduce this behavior by adding that configuration to testReadWrite() in TestInputOutputFormat.java.

Do you think that's normal? In my case, I changed Long.getLong(expectedNDVs[i]) by Long.parseLong(expectedNDVs[i]) to build the kv HashMap.

@chenjunjiedada
Copy link
Contributor Author

@garawalid , I can get configuration when putting settings in runMapReduceJob. Where did you put the settings?

@garawalid
Copy link
Contributor

@chenjunjiedada
Strange! I add the configuration here: https://github.com/apache/parquet-mr/blob/1fc273329cd6d54085de81ecccfe99be181f24be/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/example/TestInputOutputFormat.java#L234-L240

Configuration:

    conf.set("parquet.bloom.filter.column.names", "content,line");
    conf.set("parquet.bloom.filter.expected.ndv","1000,200");

@garawalid
Copy link
Contributor

@chenjunjiedada
I noticed that expectedNDVs[i] gets the string value as expected (which is 1000 in our case). The problem comes when we cast it to Long (via Long.getLong()) which leads to a null.

chenjunjiedada added a commit to chenjunjiedada/parquet-mr that referenced this pull request Jan 7, 2020
gszadovszky pushed a commit that referenced this pull request Feb 26, 2020
* PARQUET-1328: Add Bloom filter reader and writer (#587)
* PARQUET-1516: Store Bloom filters near to footer (#608)
* PARQUET-1391: Integrate Bloom filter logic (#619)
* PARQUET-1660: align Bloom filter implementation with format (#686)
shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Mar 1, 2020
* PARQUET-1328: Add Bloom filter reader and writer (apache#587)
* PARQUET-1516: Store Bloom filters near to footer (apache#608)
* PARQUET-1391: Integrate Bloom filter logic (apache#619)
* PARQUET-1660: align Bloom filter implementation with format (apache#686)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants