-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1328: Add Bloom filter reader and writer #521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @rdblue and @jbapple-cloudera Do you want me to separate this into just reader and writer parts? |
| package org.apache.parquet.cli.util; | ||
|
|
||
| import com.google.common.base.Objects; | ||
| import com.google.common.base.MoreObjects; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep changes like this in a different patch. Every patch should have a single purpose.
| private final ByteBufferAllocator allocator; | ||
| private final ValuesWriterFactory valuesWriterFactory; | ||
| private final boolean enableBloomFilter; | ||
| private final HashMap<String, Long> bloomFilterInfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be more specific: what info?
| /** | ||
| * Set Bloom filter info for columns. | ||
| * | ||
| * @param names the columns to be enable for Bloom filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "enabled"
| * Set Bloom filter info for columns. | ||
| * | ||
| * @param names the columns to be enable for Bloom filter | ||
| * @param sizes the sizes corresponding to columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you measure "size"? Do you mean the number of distinct values?
| * @param sizes the sizes corresponding to columns | ||
| * @return this builder for method chaining | ||
| */ | ||
| public Builder withBloomFilterInfo(String names, String sizes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not List<Column> where class Column { String name; long countDistinct; } or maybe List<String> names, List<Long> sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to use List here, we have to parse the string to List early. It needs a copy from string array to List.
| import java.nio.IntBuffer; | ||
|
|
||
| /** | ||
| * A Bloom filter is a compact structure to indicate whether an item is not in a set or probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we made a number of grammatical fixes to the parquet-cpp prose in the BF PR. Can you copy those to this patch, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(complete, thank you!)
| // The underlying byte array for Bloom filter bitset. | ||
| private byte[] bitset; | ||
|
|
||
| // A integer array buffer of underlying bitset to help setting bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use an array of ints and don't use a byte array at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the int array, we have to afford one more copy when writing to OutputStream. I haven't find a way to convert int array to byte array without copy in java.
| private byte[] bitset; | ||
|
|
||
| // A integer array buffer of underlying bitset to help setting bits. | ||
| private IntBuffer intBuffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use JMeter to check if this is taking up 2x the memory because you use both bitset and intBuffer?
| * instruction. | ||
| */ | ||
|
|
||
| public class BloomFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class deserves some tests.
| * @param hashStrategy The hash strategy of Bloom filter. | ||
| * @param algorithm The algorithm of Bloom filter. | ||
| */ | ||
| private BloomFilter(int numBytes, HashStrategy hashStrategy, Algorithm algorithm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like not just the prose, but also the code of this patch did not take into account all the feedback on the C++ code: apache/parquet-cpp#432
I'm going to pause my review for a moment so as not to overwhelm with repeated feedback and give you time to incorporate that feedback. This line made me notice the difference because there was a long discussion on the other patch about how to incorporate Algorithm algorithm into the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jbapple-cloudera for reviewing.
This patch makes some confusion here. It is just a early patch to show how a Bloom filter is integrated with ParquetFileReader and ParuqetFileWriter as you mentioned in vote email. With this we can commit parquet-format patch firstly. Since without parquet-format patch, the reader and writer implementation cannot pass the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me.
|
@cjjnjust , now that parquet-format has the format change, are you ready to edit and proceed, or is there something else that should be done first? |
|
@jbapple-cloudera I 'm refreshing reader/writer today and will upload a bit later. |
3c878c8 to
1a0875b
Compare
|
@jbapple-cloudera Looks like I have to rebase this firstly if I want to use PR. Since we have a new separated branch, we can also utilize that branch to open a new one. Which one do you prefer? I saw the parquet sync note says to start a vote on design. Not sure that means we need to wait on the vote I started yet? But I think we can proceed this in parallel as well. |
| import java.nio.IntBuffer; | ||
|
|
||
| /** | ||
| * A Bloom filter is a compact structure to indicate whether an item is not in a set or probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(complete, thank you!)
| * in a set. The Bloom filter usually consists of a bit set that represents a elements set, | ||
| * a hash strategy and a Bloom filter algorithm. | ||
| */ | ||
| public abstract class BloomFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why abstract class, rather than interface?
| public abstract class BloomFilter { | ||
| // Bloom filter Hash strategy. | ||
| public enum HashStrategy { | ||
| MURMUR3_X64_128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| MURMUR3_X64_128, | |
| MURMUR3_X64_128(0) | |
| public final int encoding; |
ordinal() caled below to serialize, is not guaranteed to be stable or equal to the ordering in parquet-cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(0) is not valid if Bloom filter change to interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then in that case I understand why it should be an abstract class.
We need some way to ensure that the enum values are the same in Java and C++. That's what I meant when I said "ordinal() caled below to serialize, is not guaranteed to be stable or equal to the ordering in parquet-cpp."
| } | ||
|
|
||
| /** | ||
| * Write the Bloom filter to an output stream. It writes the Bloom filter header includes the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Write the Bloom filter to an output stream. It writes the Bloom filter header includes the | |
| * Write the Bloom filter to an output stream. It writes the Bloom filter header including the |
|
|
||
| /** | ||
| * Write the Bloom filter to an output stream. It writes the Bloom filter header includes the | ||
| * bitset's length in size of byte, the hash strategy, the algorithm, and the bitset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * bitset's length in size of byte, the hash strategy, the algorithm, and the bitset. | |
| * bitset's length in bytes, the hash strategy, the algorithm, and the bitset. |
| * @param bloomFilterDistinctNumbers the expected distinct number of values corresponding to columns | ||
| * @return this builder for method chaining | ||
| */ | ||
| public Builder withBloomFilterInfo(String bloomFilterColumnNames, String bloomFilterDistinctNumbers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has no callers. Why is it needed? Can you add a test for it?
| } | ||
|
|
||
| abstract ColumnWriterBase createColumnWriter(ColumnDescriptor path, PageWriter pageWriter, ParquetProperties props); | ||
| ColumnWriteStoreBase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block could use some comments.
| ) { | ||
| this(path, pageWriter, props); | ||
|
|
||
| // Current not support nested column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Current not support nested column. | |
| // Bloom filters don't support nested columns yet; see PARQUET-????. |
| public void testConstructor () throws IOException { | ||
| BloomFilter bloomFilter1 = new BlockSplitBloomFilter(0); | ||
| assertEquals(bloomFilter1.getBitsetSize(), BlockSplitBloomFilter.MINIMUM_BLOOM_FILTER_BYTES); | ||
| BloomFilter bloomFilter2 = new BlockSplitBloomFilter(256 * 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| BloomFilter bloomFilter2 = new BlockSplitBloomFilter(256 * 1024 * 1024); | |
| BloomFilter bloomFilter2 = new BlockSplitBloomFilter(BlockSplitBloomFilter, MAXIMUM_BLOOM_FILTER_BYTES + 1); |
| // The underlying byte array for Bloom filter bitset. | ||
| private byte[] bitset; | ||
|
|
||
| // A integer array buffer of underlying bitset to help setting bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use JMeter to check if this is taking up 2x the memory because you use both bitset and intBuffer?
If so, and if removing one of bitset and intBuffer makes writing the output a little slower, I still think only one should be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal heap buffer pointer of ByteBuffer points to address of bitset, it doesn't allocate buffer according to API definition. I can do JMeter check later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Please do ping this when it is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just get some time for this. JMeter seems not suit for showing the memory layout for java heap. I wrote a example application and use jmap to dump the heap. The code snippet is like:
byte [] bitset = new byte[1024*1024];
IntBuffer intBuffer = ByteBuffer.wrap(bitset).asIntBuffer();
ByteBuffer byteBuffer = ByteBuffer.allocate(1024*1024);
byte[] bitset2 = new byte[1024*1024];The Eden space in heap dumps after every sentence are:
Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used
Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used
Eden Space:
capacity = 66060288 (63.0MB)
used = 4739624 (4.520057678222656MB)
free = 61320664 (58.479942321777344MB)
7.17469472733755% used
Eden Space:
capacity = 66060288 (63.0MB)
used = 5788216 (5.520072937011719MB)
free = 60272072 (57.47992706298828MB)
8.762020534939236% used
The Eden space does not increase when we call the wrap API of ByteBuffer.
| // The underlying byte array for Bloom filter bitset. | ||
| private byte[] bitset; | ||
|
|
||
| // A integer array buffer of underlying bitset to help setting bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Please do ping this when it is complete.
|
|
||
| /** | ||
| * Constructor of Bloom filter. It uses murmur3_x64_128 as its default hash | ||
| * function and block-based algorithm as its default algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You removed the comment above about the block algorithm being the default, now please follow-through and remove it here and in the parameter lists - BlockSplitBloomFilter supports exactly one BF algorithm, so it's not of high utility to allow users to try and specify one only to find it is not respected.
| return getEnableDictionary(getConfiguration(jobContext)); | ||
| } | ||
|
|
||
| public static String getBloomFilterColumnNames(Configuration conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why separate getBloomFilterColumnNames from getBloomFilterExpectedNDV, and why use Strings? Why not just directly return a map and have a single method, getBloomFilterColumnExpectedNDVs?
| } | ||
|
|
||
| abstract ColumnWriterBase createColumnWriter(ColumnDescriptor path, PageWriter pageWriter, ParquetProperties props); | ||
| // The Bloom filter is written to a specified bitset instead of pages. So it needs a separated write store abstract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // The Bloom filter is written to a specified bitset instead of pages. So it needs a separated write store abstract. | |
| // The Bloom filter is written to a specified bitset instead of pages, so it needs a separate write store abstract. |
|
Hi @jbapple-cloudera |
|
Hi @jbapple-cloudera, I replied and updated code for all comments in this PR and also created a new PR which is based on the bloom-filter branch. You can comment on that one if you wish. |
|
It seems the work was moved to the feature branch (#587). I would suggest closing this one if no longer used. |
|
@gszadovszky , just closed. |
This is an early PR for Bloom filter reader and writer, which is used to discovery some potential concerns for parquet-format and Bloom filter spec.