Skip to content

Conversation

@chenjunjiedada
Copy link
Contributor

This is an early PR for Bloom filter reader and writer, which is used to discovery some potential concerns for parquet-format and Bloom filter spec.

@chenjunjiedada
Copy link
Contributor Author

Hi @rdblue and @jbapple-cloudera

Do you want me to separate this into just reader and writer parts?

package org.apache.parquet.cli.util;

import com.google.common.base.Objects;
import com.google.common.base.MoreObjects;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep changes like this in a different patch. Every patch should have a single purpose.

private final ByteBufferAllocator allocator;
private final ValuesWriterFactory valuesWriterFactory;
private final boolean enableBloomFilter;
private final HashMap<String, Long> bloomFilterInfo;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be more specific: what info?

/**
* Set Bloom filter info for columns.
*
* @param names the columns to be enable for Bloom filter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "enabled"

* Set Bloom filter info for columns.
*
* @param names the columns to be enable for Bloom filter
* @param sizes the sizes corresponding to columns

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you measure "size"? Do you mean the number of distinct values?

* @param sizes the sizes corresponding to columns
* @return this builder for method chaining
*/
public Builder withBloomFilterInfo(String names, String sizes) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not List<Column> where class Column { String name; long countDistinct; } or maybe List<String> names, List<Long> sizes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use List here, we have to parse the string to List early. It needs a copy from string array to List.

import java.nio.IntBuffer;

/**
* A Bloom filter is a compact structure to indicate whether an item is not in a set or probably

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we made a number of grammatical fixes to the parquet-cpp prose in the BF PR. Can you copy those to this patch, please?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(complete, thank you!)

// The underlying byte array for Bloom filter bitset.
private byte[] bitset;

// A integer array buffer of underlying bitset to help setting bits.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use an array of ints and don't use a byte array at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the int array, we have to afford one more copy when writing to OutputStream. I haven't find a way to convert int array to byte array without copy in java.

private byte[] bitset;

// A integer array buffer of underlying bitset to help setting bits.
private IntBuffer intBuffer;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use JMeter to check if this is taking up 2x the memory because you use both bitset and intBuffer?

* instruction.
*/

public class BloomFilter {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class deserves some tests.

* @param hashStrategy The hash strategy of Bloom filter.
* @param algorithm The algorithm of Bloom filter.
*/
private BloomFilter(int numBytes, HashStrategy hashStrategy, Algorithm algorithm) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like not just the prose, but also the code of this patch did not take into account all the feedback on the C++ code: apache/parquet-cpp#432

I'm going to pause my review for a moment so as not to overwhelm with repeated feedback and give you time to incorporate that feedback. This line made me notice the difference because there was a long discussion on the other patch about how to incorporate Algorithm algorithm into the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jbapple-cloudera for reviewing.

This patch makes some confusion here. It is just a early patch to show how a Bloom filter is integrated with ParquetFileReader and ParuqetFileWriter as you mentioned in vote email. With this we can commit parquet-format patch firstly. Since without parquet-format patch, the reader and writer implementation cannot pass the build.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me.

@jbapple-cloudera
Copy link

@cjjnjust , now that parquet-format has the format change, are you ready to edit and proceed, or is there something else that should be done first?

@chenjunjiedada
Copy link
Contributor Author

@jbapple-cloudera I 'm refreshing reader/writer today and will upload a bit later.

@chenjunjiedada
Copy link
Contributor Author

@jbapple-cloudera Looks like I have to rebase this firstly if I want to use PR. Since we have a new separated branch, we can also utilize that branch to open a new one. Which one do you prefer?

I saw the parquet sync note says to start a vote on design. Not sure that means we need to wait on the vote I started yet? But I think we can proceed this in parallel as well.

import java.nio.IntBuffer;

/**
* A Bloom filter is a compact structure to indicate whether an item is not in a set or probably

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(complete, thank you!)

* in a set. The Bloom filter usually consists of a bit set that represents a elements set,
* a hash strategy and a Bloom filter algorithm.
*/
public abstract class BloomFilter {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why abstract class, rather than interface?

public abstract class BloomFilter {
// Bloom filter Hash strategy.
public enum HashStrategy {
MURMUR3_X64_128,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MURMUR3_X64_128,
MURMUR3_X64_128(0)
public final int encoding;

ordinal() caled below to serialize, is not guaranteed to be stable or equal to the ordering in parquet-cpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(0) is not valid if Bloom filter change to interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then in that case I understand why it should be an abstract class.

We need some way to ensure that the enum values are the same in Java and C++. That's what I meant when I said "ordinal() caled below to serialize, is not guaranteed to be stable or equal to the ordering in parquet-cpp."

}

/**
* Write the Bloom filter to an output stream. It writes the Bloom filter header includes the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Write the Bloom filter to an output stream. It writes the Bloom filter header includes the
* Write the Bloom filter to an output stream. It writes the Bloom filter header including the


/**
* Write the Bloom filter to an output stream. It writes the Bloom filter header includes the
* bitset's length in size of byte, the hash strategy, the algorithm, and the bitset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* bitset's length in size of byte, the hash strategy, the algorithm, and the bitset.
* bitset's length in bytes, the hash strategy, the algorithm, and the bitset.

* @param bloomFilterDistinctNumbers the expected distinct number of values corresponding to columns
* @return this builder for method chaining
*/
public Builder withBloomFilterInfo(String bloomFilterColumnNames, String bloomFilterDistinctNumbers) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no callers. Why is it needed? Can you add a test for it?

}

abstract ColumnWriterBase createColumnWriter(ColumnDescriptor path, PageWriter pageWriter, ParquetProperties props);
ColumnWriteStoreBase(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block could use some comments.

) {
this(path, pageWriter, props);

// Current not support nested column.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Current not support nested column.
// Bloom filters don't support nested columns yet; see PARQUET-????.

public void testConstructor () throws IOException {
BloomFilter bloomFilter1 = new BlockSplitBloomFilter(0);
assertEquals(bloomFilter1.getBitsetSize(), BlockSplitBloomFilter.MINIMUM_BLOOM_FILTER_BYTES);
BloomFilter bloomFilter2 = new BlockSplitBloomFilter(256 * 1024 * 1024);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BloomFilter bloomFilter2 = new BlockSplitBloomFilter(256 * 1024 * 1024);
BloomFilter bloomFilter2 = new BlockSplitBloomFilter(BlockSplitBloomFilter, MAXIMUM_BLOOM_FILTER_BYTES + 1);

// The underlying byte array for Bloom filter bitset.
private byte[] bitset;

// A integer array buffer of underlying bitset to help setting bits.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use JMeter to check if this is taking up 2x the memory because you use both bitset and intBuffer?

If so, and if removing one of bitset and intBuffer makes writing the output a little slower, I still think only one should be present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal heap buffer pointer of ByteBuffer points to address of bitset, it doesn't allocate buffer according to API definition. I can do JMeter check later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Please do ping this when it is complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just get some time for this. JMeter seems not suit for showing the memory layout for java heap. I wrote a example application and use jmap to dump the heap. The code snippet is like:

        byte [] bitset = new byte[1024*1024];
        IntBuffer intBuffer = ByteBuffer.wrap(bitset).asIntBuffer();
        ByteBuffer byteBuffer = ByteBuffer.allocate(1024*1024);
        byte[] bitset2 = new byte[1024*1024];

The Eden space in heap dumps after every sentence are:

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 4739624 (4.520057678222656MB)
free = 61320664 (58.479942321777344MB)
7.17469472733755% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 5788216 (5.520072937011719MB)
free = 60272072 (57.47992706298828MB)
8.762020534939236% used

The Eden space does not increase when we call the wrap API of ByteBuffer.

// The underlying byte array for Bloom filter bitset.
private byte[] bitset;

// A integer array buffer of underlying bitset to help setting bits.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Please do ping this when it is complete.


/**
* Constructor of Bloom filter. It uses murmur3_x64_128 as its default hash
* function and block-based algorithm as its default algorithm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed the comment above about the block algorithm being the default, now please follow-through and remove it here and in the parameter lists - BlockSplitBloomFilter supports exactly one BF algorithm, so it's not of high utility to allow users to try and specify one only to find it is not respected.

return getEnableDictionary(getConfiguration(jobContext));
}

public static String getBloomFilterColumnNames(Configuration conf) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate getBloomFilterColumnNames from getBloomFilterExpectedNDV, and why use Strings? Why not just directly return a map and have a single method, getBloomFilterColumnExpectedNDVs?

}

abstract ColumnWriterBase createColumnWriter(ColumnDescriptor path, PageWriter pageWriter, ParquetProperties props);
// The Bloom filter is written to a specified bitset instead of pages. So it needs a separated write store abstract.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The Bloom filter is written to a specified bitset instead of pages. So it needs a separated write store abstract.
// The Bloom filter is written to a specified bitset instead of pages, so it needs a separate write store abstract.

@chenjunjiedada chenjunjiedada changed the base branch from master to bloom-filter December 1, 2018 11:06
@chenjunjiedada chenjunjiedada changed the base branch from bloom-filter to master December 1, 2018 11:08
@chenjunjiedada
Copy link
Contributor Author

Hi @jbapple-cloudera
I have updated two remain comments, could you please have a look? If there is no more comments I will change PR base to bloom-filter branch to follow the feature branch flow, It needs some cherry-pick job and might lose some review comments. Is that OK?

@chenjunjiedada
Copy link
Contributor Author

Hi @jbapple-cloudera, I replied and updated code for all comments in this PR and also created a new PR which is based on the bloom-filter branch. You can comment on that one if you wish.

@gszadovszky
Copy link
Contributor

It seems the work was moved to the feature branch (#587). I would suggest closing this one if no longer used.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky , just closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants