PARQUET-1328: Add Bloom filter reader and writer #521

chenjunjiedada · 2018-09-08T15:22:19Z

This is an early PR for Bloom filter reader and writer, which is used to discovery some potential concerns for parquet-format and Bloom filter spec.

chenjunjiedada · 2018-10-03T07:13:01Z

Hi @rdblue and @jbapple-cloudera

Do you want me to separate this into just reader and writer parts?

jbapple-cloudera · 2018-10-06T22:50:57Z

parquet-cli/src/main/java/org/apache/parquet/cli/util/Expressions.java

 package org.apache.parquet.cli.util;

-import com.google.common.base.Objects;
+import com.google.common.base.MoreObjects;


Please keep changes like this in a different patch. Every patch should have a single purpose.

jbapple-cloudera · 2018-10-06T22:53:20Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

  private final ByteBufferAllocator allocator;
  private final ValuesWriterFactory valuesWriterFactory;
+  private final boolean enableBloomFilter;
+  private final HashMap<String, Long> bloomFilterInfo;


Please be more specific: what info?

jbapple-cloudera · 2018-10-06T22:54:49Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

+    /**
+     * Set Bloom filter info for columns.
+     *
+     * @param names the columns to be enable for Bloom filter


nit: "enabled"

jbapple-cloudera · 2018-10-06T22:55:16Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

+     * Set Bloom filter info for columns.
+     *
+     * @param names the columns to be enable for Bloom filter
+     * @param sizes the sizes corresponding to columns


How do you measure "size"? Do you mean the number of distinct values?

jbapple-cloudera · 2018-10-06T22:58:01Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

+     * @param sizes the sizes corresponding to columns
+     * @return this builder for method chaining
+     */
+    public Builder withBloomFilterInfo(String names, String sizes) {


Why not List<Column> where class Column { String name; long countDistinct; } or maybe List<String> names, List<Long> sizes?

If we want to use List here, we have to parse the string to List early. It needs a copy from string array to List.

jbapple-cloudera · 2018-10-06T23:29:42Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+import java.nio.IntBuffer;
+
+/**
+ * A Bloom filter is a compact structure to indicate whether an item is not in a set or probably


Nit: we made a number of grammatical fixes to the parquet-cpp prose in the BF PR. Can you copy those to this patch, please?

(complete, thank you!)

jbapple-cloudera · 2018-10-07T00:36:34Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+  // The underlying byte array for Bloom filter bitset.
+  private byte[] bitset;
+
+  // A integer array buffer of underlying bitset to help setting bits.


Why not just use an array of ints and don't use a byte array at all?

If we use the int array, we have to afford one more copy when writing to OutputStream. I haven't find a way to convert int array to byte array without copy in java.

jbapple-cloudera · 2018-10-07T00:36:40Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+  private byte[] bitset;
+
+  // A integer array buffer of underlying bitset to help setting bits.
+  private IntBuffer intBuffer;


can you use JMeter to check if this is taking up 2x the memory because you use both bitset and intBuffer?

jbapple-cloudera · 2018-10-07T00:37:35Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+ * instruction.
+ */
+
+public class BloomFilter {


This class deserves some tests.

jbapple-cloudera · 2018-10-07T00:51:44Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+   * @param hashStrategy The hash strategy of Bloom filter.
+   * @param algorithm The algorithm of Bloom filter.
+   */
+  private BloomFilter(int numBytes, HashStrategy hashStrategy, Algorithm algorithm) {


It looks like not just the prose, but also the code of this patch did not take into account all the feedback on the C++ code: apache/parquet-cpp#432

I'm going to pause my review for a moment so as not to overwhelm with repeated feedback and give you time to incorporate that feedback. This line made me notice the difference because there was a long discussion on the other patch about how to incorporate Algorithm algorithm into the constructor.

Thanks @jbapple-cloudera for reviewing.

This patch makes some confusion here. It is just a early patch to show how a Bloom filter is integrated with ParquetFileReader and ParuqetFileWriter as you mentioned in vote email. With this we can commit parquet-format patch firstly. Since without parquet-format patch, the reader and writer implementation cannot pass the build.

That works for me.

jbapple-cloudera · 2018-10-20T15:08:46Z

@cjjnjust , now that parquet-format has the format change, are you ready to edit and proceed, or is there something else that should be done first?

chenjunjiedada · 2018-10-20T16:25:23Z

@jbapple-cloudera I 'm refreshing reader/writer today and will upload a bit later.

chenjunjiedada · 2018-10-20T17:42:14Z

@jbapple-cloudera Looks like I have to rebase this firstly if I want to use PR. Since we have a new separated branch, we can also utilize that branch to open a new one. Which one do you prefer?

I saw the parquet sync note says to start a vote on design. Not sure that means we need to wait on the vote I started yet? But I think we can proceed this in parallel as well.

jbapple-cloudera · 2018-10-24T13:26:16Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+import java.nio.IntBuffer;
+
+/**
+ * A Bloom filter is a compact structure to indicate whether an item is not in a set or probably


(complete, thank you!)

jbapple-cloudera · 2018-10-24T13:29:50Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+ * in a set. The Bloom filter usually consists of a bit set that represents a elements set,
+ * a hash strategy and a Bloom filter algorithm.
+ */
+public abstract class BloomFilter {


Why abstract class, rather than interface?

jbapple-cloudera · 2018-10-24T13:34:01Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+public abstract class BloomFilter {
+  // Bloom filter Hash strategy.
+  public enum HashStrategy {
+    MURMUR3_X64_128,


Suggested change

MURMUR3_X64_128,

MURMUR3_X64_128(0)

public final int encoding;

ordinal() caled below to serialize, is not guaranteed to be stable or equal to the ordering in parquet-cpp.

(0) is not valid if Bloom filter change to interface.

OK, then in that case I understand why it should be an abstract class.

We need some way to ensure that the enum values are the same in Java and C++. That's what I meant when I said "ordinal() caled below to serialize, is not guaranteed to be stable or equal to the ordering in parquet-cpp."

jbapple-cloudera · 2018-10-24T13:34:42Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+  }
+
+  /**
+   * Write the Bloom filter to an output stream. It writes the Bloom filter header includes the


Suggested change

* Write the Bloom filter to an output stream. It writes the Bloom filter header includes the

* Write the Bloom filter to an output stream. It writes the Bloom filter header including the

jbapple-cloudera · 2018-10-24T13:35:00Z

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

+
+  /**
+   * Write the Bloom filter to an output stream. It writes the Bloom filter header includes the
+   * bitset's length in size of byte, the hash strategy, the algorithm, and the bitset.


Suggested change

* bitset's length in size of byte, the hash strategy, the algorithm, and the bitset.

* bitset's length in bytes, the hash strategy, the algorithm, and the bitset.

jbapple-cloudera · 2018-10-24T16:56:03Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

+     * @param bloomFilterDistinctNumbers the expected distinct number of values corresponding to columns
+     * @return this builder for method chaining
+     */
+    public Builder withBloomFilterInfo(String bloomFilterColumnNames, String bloomFilterDistinctNumbers) {


This has no callers. Why is it needed? Can you add a test for it?

jbapple-cloudera · 2018-10-24T17:28:03Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java

  }

-  abstract ColumnWriterBase createColumnWriter(ColumnDescriptor path, PageWriter pageWriter, ParquetProperties props);
+  ColumnWriteStoreBase(


This block could use some comments.

jbapple-cloudera · 2018-10-24T17:31:22Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

+  ) {
+    this(path, pageWriter, props);
+
+    // Current not support nested column.


Suggested change

// Current not support nested column.

// Bloom filters don't support nested columns yet; see PARQUET-????.

jbapple-cloudera · 2018-10-28T19:18:09Z

...mn/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java

+  public void testConstructor () throws IOException {
+    BloomFilter bloomFilter1 = new BlockSplitBloomFilter(0);
+    assertEquals(bloomFilter1.getBitsetSize(), BlockSplitBloomFilter.MINIMUM_BLOOM_FILTER_BYTES);
+    BloomFilter bloomFilter2 = new BlockSplitBloomFilter(256 * 1024 * 1024);


Suggested change

BloomFilter bloomFilter2 = new BlockSplitBloomFilter(256 * 1024 * 1024);

BloomFilter bloomFilter2 = new BlockSplitBloomFilter(BlockSplitBloomFilter, MAXIMUM_BLOOM_FILTER_BYTES + 1);

jbapple-cloudera · 2018-10-28T19:24:07Z

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

+  // The underlying byte array for Bloom filter bitset.
+  private byte[] bitset;
+
+  // A integer array buffer of underlying bitset to help setting bits.


Can you use JMeter to check if this is taking up 2x the memory because you use both bitset and intBuffer?

If so, and if removing one of bitset and intBuffer makes writing the output a little slower, I still think only one should be present.

The internal heap buffer pointer of ByteBuffer points to address of bitset, it doesn't allocate buffer according to API definition. I can do JMeter check later.

Great! Please do ping this when it is complete.

Just get some time for this. JMeter seems not suit for showing the memory layout for java heap. I wrote a example application and use jmap to dump the heap. The code snippet is like:

byte [] bitset = new byte[1024*1024]; IntBuffer intBuffer = ByteBuffer.wrap(bitset).asIntBuffer(); ByteBuffer byteBuffer = ByteBuffer.allocate(1024*1024); byte[] bitset2 = new byte[1024*1024];

The Eden space in heap dumps after every sentence are:

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 3691032 (3.5200424194335938MB)
free = 62369256 (59.479957580566406MB)
5.587368919735863% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 4739624 (4.520057678222656MB)
free = 61320664 (58.479942321777344MB)
7.17469472733755% used

Eden Space:
capacity = 66060288 (63.0MB)
used = 5788216 (5.520072937011719MB)
free = 60272072 (57.47992706298828MB)
8.762020534939236% used

The Eden space does not increase when we call the wrap API of ByteBuffer.

jbapple-cloudera · 2018-10-31T04:05:15Z

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

+  // The underlying byte array for Bloom filter bitset.
+  private byte[] bitset;
+
+  // A integer array buffer of underlying bitset to help setting bits.


Great! Please do ping this when it is complete.

jbapple-cloudera · 2018-10-31T04:07:32Z

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

+
+  /**
+   * Constructor of Bloom filter. It uses murmur3_x64_128 as its default hash
+   * function and block-based algorithm as its default algorithm.


You removed the comment above about the block algorithm being the default, now please follow-through and remove it here and in the parameter lists - BlockSplitBloomFilter supports exactly one BF algorithm, so it's not of high utility to allow users to try and specify one only to find it is not respected.

jbapple-cloudera · 2018-10-31T04:09:59Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java

    return getEnableDictionary(getConfiguration(jobContext));
  }

+  public static String getBloomFilterColumnNames(Configuration conf) {


Why separate getBloomFilterColumnNames from getBloomFilterExpectedNDV, and why use Strings? Why not just directly return a map and have a single method, getBloomFilterColumnExpectedNDVs?

jbapple-cloudera · 2018-10-31T04:11:07Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java

  }

-  abstract ColumnWriterBase createColumnWriter(ColumnDescriptor path, PageWriter pageWriter, ParquetProperties props);
+  // The Bloom filter is written to a specified bitset instead of pages. So it needs a separated write store abstract.


Suggested change

// The Bloom filter is written to a specified bitset instead of pages. So it needs a separated write store abstract.

// The Bloom filter is written to a specified bitset instead of pages, so it needs a separate write store abstract.

chenjunjiedada · 2018-12-01T12:43:41Z

Hi @jbapple-cloudera
I have updated two remain comments, could you please have a look? If there is no more comments I will change PR base to bloom-filter branch to follow the feature branch flow, It needs some cherry-pick job and might lose some review comments. Is that OK?

chenjunjiedada · 2019-01-10T15:42:30Z

Hi @jbapple-cloudera, I replied and updated code for all comments in this PR and also created a new PR which is based on the bloom-filter branch. You can comment on that one if you wish.

gszadovszky · 2019-01-14T08:42:29Z

It seems the work was moved to the feature branch (#587). I would suggest closing this one if no longer used.

chenjunjiedada · 2019-01-14T08:47:58Z

@gszadovszky , just closed.

PARQUET-1328: Add Bloom filter reader and writer

81c3063

jbapple-cloudera suggested changes Oct 7, 2018

View reviewed changes

Align to parquet-cpp side code and address comments

1a0875b

chenjunjiedada force-pushed the PARQUET-1328 branch from 3c878c8 to 1a0875b Compare October 20, 2018 17:27

chenjunjiedada and others added 3 commits October 21, 2018 21:29

Rebase to latest master

e3991ee

Merge branch 'master' into PARQUET-1328

05aac07

Fix conflicts after rebase and merge

b8a0f5c

jbapple-cloudera suggested changes Oct 28, 2018

View reviewed changes

address comments

1b646a9

jbapple-cloudera suggested changes Oct 31, 2018

View reviewed changes

address comments and fix enum issue

f03d875

chenjunjiedada mentioned this pull request Nov 7, 2018

PARQUET-1342:Add bloom filter utility class #425

Closed

chenjunjiedada changed the base branch from master to bloom-filter December 1, 2018 11:06

chenjunjiedada changed the base branch from bloom-filter to master December 1, 2018 11:08

Merge remote-tracking branch 'official/master' into PARQUET-1328

4fcd761

Chen, Junjie and others added 3 commits December 25, 2018 11:05

Fix build issue caused by merge

5e4647f

test build

894040d

update check for Bloom filter reader

fb0ab5c

chenjunjiedada closed this Jan 14, 2019

	MURMUR3_X64_128,
	MURMUR3_X64_128(0)
	public final int encoding;

	* Write the Bloom filter to an output stream. It writes the Bloom filter header includes the
	* Write the Bloom filter to an output stream. It writes the Bloom filter header including the

	* bitset's length in size of byte, the hash strategy, the algorithm, and the bitset.
	* bitset's length in bytes, the hash strategy, the algorithm, and the bitset.

	// Current not support nested column.
	// Bloom filters don't support nested columns yet; see PARQUET-????.

	BloomFilter bloomFilter2 = new BlockSplitBloomFilter(256 * 1024 * 1024);
	BloomFilter bloomFilter2 = new BlockSplitBloomFilter(BlockSplitBloomFilter, MAXIMUM_BLOOM_FILTER_BYTES + 1);

	// The Bloom filter is written to a specified bitset instead of pages. So it needs a separated write store abstract.
	// The Bloom filter is written to a specified bitset instead of pages, so it needs a separate write store abstract.

PARQUET-1328: Add Bloom filter reader and writer #521

PARQUET-1328: Add Bloom filter reader and writer #521

Uh oh!

Conversation

chenjunjiedada commented Sep 8, 2018

Uh oh!

chenjunjiedada commented Oct 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbapple-cloudera commented Oct 20, 2018

Uh oh!

chenjunjiedada commented Oct 20, 2018

Uh oh!

chenjunjiedada commented Oct 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!