PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly #1023

yabola · 2023-02-04T09:21:44Z

If we can accurately judge by the minMax status, we don’t need to load the dictionary from filesystem and compare one by one anymore.

Similarly , Bloomfilter needs to load from filesystem, it may costs time and memory. If we can exactly determine the existence/nonexistence of the value from minMax or dictionary filters , then we can avoid using Bloomfilter to Improve performance.

For example,

read data greater than x1 in the block, if minMax in status is all greater than x1, then we don't need to read dictionary from filesystem and compare one by one.
If we already have page dictionaries and have compared one by one, we don't need to read BloomFilter from filesystem and compare.
some more cases...

(cherry picked from commit 2ce35c7)

…pache#889)

…ary encoding (apache#910)

Unit test: - Updated ParquetWriter to support setting row group size in long - Removed Xmx settings in the pom to allow more memory for the tests Co-authored-by: Gabor Szadovszky <[email protected]>

…le (apache#913) * use try-with-resource statement for ParquetFileReader to call close explicitly

)

…e. (apache#922)

… … (apache#925) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Read path fix that make usage of this information: RowGroup[n].file_offset = RowGroup[n-1].file_offset + RowGroup[n-1].total_compressed_size * PARQUET-2078 Failed to read parquet file after writing with the same parquet version addressing review comments: more check on writer side. * PARQUET-2078 Failed to read parquet file after writing with the same parquet version taking alignment padding and sumarry file into account * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only throw exception when: 1.footer(first column of block meta) encrypted and 2.file_offset corrupted * PARQUET-2078 Failed to read parquet file after writing with the same parquet version only check firstColumnChunk.isSetMeta_data() for the first block * PARQUET-2078 Failed to read parquet file after writing with the same parquet version address review comments: empty lines * PARQUET-2078 Failed to read parquet file after writing with the same parquet version check first rowgroup's file_offset too(SPARK-36696) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version Using Preconditions.checkState instead of assert in write path remove summary file footers case check in read path(which will never happen) * PARQUET-2078 Failed to read parquet file after writing with the same parquet version more special case for first row group

The purpose of this change is to fail the build if some classes are used from not direct dependencies. Only classes from direct dependencies shall be used. Also fixed some references that broke this rule.

This reverts commit 261e320.

(cherry picked from commit 1695d92)

This reverts commit 0f6fc7f.

wgtmac

Thanks @yabola for the fix! The idea of the patch is good and the algorithm should be correct (at least I cannot come up with a counter example yet). However, we still need to be careful just in case. My main concern is the test coverage. In addition, we may also need an option to toggle it off just in case.

wgtmac · 2023-02-04T14:55:07Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java

      boolean drop = false;
+      // Whether one filter can exactly determine the existence/nonexistence of the value.
+      // If true then we can skip the remaining filters to save time and space.
+      AtomicBoolean canExactlyDetermine = new AtomicBoolean(false);


Why atomic?

It used to be for the convenience of fetching the returned results. But I will change my codes in another implemention later

wgtmac · 2023-02-04T15:20:59Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

+
+  private <T extends Comparable<T>> void markCanExactlyDetermine(Set<T> dictSet) {
+    if (dictSet == null) {
+      canExactlyDetermine = false;


It seems that canExactlyDetermine should use OR to update its value. Otherwise, any predicate with a null dict will set it to false even if previous predicates have marked it to true.

Additionally, we may have a chance to shortcut the evaluation as well if any predicate has set it to true.

wgtmac · 2023-02-04T15:42:01Z

...et-hadoop/src/test/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilterTest.java

  }

+  @Test
+  public void testCanSkipOtherFilters() {


The test looks a little bit insufficient. More kinds of predicates and compound predicates need to be covered. Also test of RowGroupFilter is missing.

I will add more UT

wgtmac · 2023-02-04T15:50:50Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java

      boolean drop = false;
+      // Whether one filter can exactly determine the existence/nonexistence of the value.
+      // If true then we can skip the remaining filters to save time and space.
+      AtomicBoolean canExactlyDetermine = new AtomicBoolean(false);


I'd suggest rename canExactlyDetermine to preciselyDetermined. Or even better, use an enum something like below

enum PredicateEvaluation { CAN_DROP, /* the block can be dropped for sure */ CANNOT_DROP, /* the block cannot be dropped for sure*/ MAY_DROP, /* cannot decide yet, may be dropped by other filter levels */ }

In this way, we can merge the the two boolean values here. The downside is that the code may need more refactoring to add the enum value to different filter classes.

I will change my implemention

wgtmac · 2023-02-06T08:43:52Z

Unfortunately we cannot modify the signature of any public methods. My suggestion was to make the new enum serves as an internal state of the visitor (and probably use it to terminate evaluation early). Then add a new method to return the final state. Does it work?

shangxinli · 2023-02-06T15:20:28Z

+1, let's not modify the signature.

yabola · 2023-02-07T02:56:23Z

@wgtmac @shangxinli I thought of a way to avoid interface modification and distinguish by Boolean objects. Please take a look

yabola · 2023-02-08T15:52:12Z

Emmmm, If this way is not suitable, I can use the filter internal variable to record it and keep compatibility

wgtmac

The new methods look reasonable to me. Could you please add more tests?

wgtmac · 2023-02-10T02:06:42Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+import org.apache.parquet.filter2.predicate.Operators;
+
+/**
+ * Used in Filters to mark whether the block data matches the condition.


We'd better explain explicitly that the evaluation decision it whether to drop the row group. That's why BLOCK_MIGHT_MATCH is false and the AND expression uses || in the implementation. This is counter-intuitive at the first glance.

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

wgtmac · 2023-02-10T02:10:08Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+  public static final Boolean BLOCK_CANNOT_MATCH = new Boolean(true);
+
+  public static Boolean evaluateAnd(Operators.And and, FilterPredicate.Visitor<Boolean> predicate) {
+    Boolean left = and.getLeft().accept(predicate);


In the current implementation, left and right predicates are always evaluated. We can short cut the evaluation if left == BLOCK_CANNOT_MATCH and skip evaluation of right predicate. It would save some cost if the right expr will read dictionary.

Yes, thanks

wgtmac · 2023-02-10T02:12:26Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+
+  public static Boolean evaluateOr(Operators.Or or, FilterPredicate.Visitor<Boolean> predicate) {
+    Boolean left = or.getLeft().accept(predicate);
+    Boolean right = or.getRight().accept(predicate);


wgtmac · 2023-02-10T02:18:44Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+    }
+  }
+
+  public static Boolean isDeterminedPredicate(Boolean predicate) {


Please add some comments to the public methods.

wgtmac · 2023-02-10T02:20:49Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java

+      Boolean predicate = BLOCK_MIGHT_MATCH;
+
+      if (levels.contains(FilterLevel.STATISTICS)) {
+


Please remove this blank line.

wgtmac · 2023-02-10T02:21:23Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java

-      if(!drop && levels.contains(FilterLevel.DICTIONARY)) {
-        drop = DictionaryFilter.canDrop(filterPredicate, block.getColumns(), reader.getDictionaryReader(block));
+      if (levels.contains(FilterLevel.DICTIONARY)) {
+


wgtmac · 2023-02-10T02:23:19Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

  }

-  private <T extends Comparable<T>> Boolean drop(Set<T> dictSet, Set<T> values) {
+  private <T extends Comparable<T>> Boolean predicate(Set<T> dictSet, Set<T> values) {


Suggested change

private <T extends Comparable<T>> Boolean predicate(Set<T> dictSet, Set<T> values) {

private <T extends Comparable<T>> Boolean evaluate(Set<T> dictSet, Set<T> values) {

Use a verb instead of noun.

wgtmac · 2023-02-10T02:28:42Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

-      return !hasNulls(meta);
+      // so if there are no nulls in this chunk, we can drop it,
+      // if there has nulls in this chunk, we must take it
+      return !hasNulls(meta) ? BLOCK_CANNOT_MATCH : BLOCK_MUST_MATCH;


I suspect that hasNull will always return precise information. When null_count is missing, hasNulls also returns true. Replacing BLOCK_MUST_MATCH with BLOCK_MIGHT_MATCH here makes more sense to me.

wgtmac · 2023-02-10T02:32:06Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

+      return BLOCK_CANNOT_MATCH;
+    } else {
+      // if value > min, we must take it
+      return BLOCK_MUST_MATCH;


Sometimes the min/max values are actually lower/upper bounds. Does this optimization still work for that case?

I think when Statistics#hasNonNullValue marks as true, minMax will be generated by the real data content, and it can represent the real data minMax ( when Statistics#hasNonNullValue is false, it has also been processed before.)
I think if we can use minMax to judge the BLOCK_CANNOT_MATCH , we can also judge the BLOCK_MUST_MATCH in some case.

@wgtmac, do you aware of any implementations where the min/max values of the row group statistics are used this way? Unfortunately, the specification does not say anything about the min or max values has to be part of the dataset or not. The safe side would be to not to rely on this requirement. (For column index statistics we have defined that the related min/max values do not need to be part of the pages but it is not relevant here.)

@gszadovszky Thank you for your review.
In the original implementation, BLOCK_CANNOT_MATCH can be judged using minMax.
So if we follow the specification, can we use minMax as the data result of an enlarged range? So we can accurately judge that when the data is not in this range.

In fact, what I wanted to do at the beginning was to avoid the use of BloomFilter through minMax and dictionary(if column has) as much as possible, because the minMax and dictionary are more accurate and BloomFilter may cost time and memory.

Good catch! I am not familiar with the old story. Does format v1 support bloom filter?

Yes, and Spark use parquet v1 by default

Godd catch indeed, @yabola! Could you open a separate jira and maybe a PR for this finding?

@wgtmac, performance. Let's see the following scenario. We have dictionary encoding but not for all the pages. We also have Bloom filter. Does it worth reading the dictionary to check if a value is in there knowing if it doesn't we still want to check the Bloom filter? I don't know the answer, maybe yes. But if it is a no, then the whole concept of this PR is questionable.
For the case of all the pages are dictionary encoded we should not have Bloom filters therefore it doesn't really matter if we return BLOCK_MIGHT_MATCH or BLOCK_MUST_MATCH in case we find the interested values in the dictionary.
Since we might already written some Bloom filters for fully dictionary encoded column chunks we should handle this scenario. But we can do it easily buy skipping reading Bloom filters in this case completely.

We have dictionary encoding but not for all the pages. We also have Bloom filter.

Yes, that's true.

Does it worth reading the dictionary to check if a value is in there knowing if it doesn't we still want to check the Bloom filter?

In this case, the dictionary will not be read via expandDictionary(meta) by DictionaryFilter if hasNonDictionaryPages(meta) returns true and will not make performance worse. e.g. https://github.com/apache/parquet-mr/blob/261f7d2679407c833545b56f4c85a4ae8b5c9ed4/parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java#L388

For the case of all the pages are dictionary encoded we should not have Bloom filters therefore it doesn't really matter if we return BLOCK_MIGHT_MATCH or BLOCK_MUST_MATCH in case we find the interested values in the dictionary.

It is difficult to make the trade-off here. If we only have one predicate, then the dictionary will read any way, either by the DictionaryFilter or by reading the data later if the row group cannot be dropped. However, if we have other predicates that can drop the row group, then reading the dictionary here by DictionaryFilter is worthless.

@wgtmac, @yabola, let me summarize my thoughts because I'm afraid I didn't describe them well before. Please correct me if I'm wrong.
In this PR we are trying to optimize the logic of RowGroupFilter. The problem with the current implementation is we step forward to the next filter even if the previous one would prove that a value we are searching for is actually (not possibly) in the row group. The idea is to introduce BLOCK_MUST_MATCH and if this is returned by any of the filters we would not step forward to the next filter and add the row group to the list (do not drop it). We currently have 3 row group level filters.

StatisticsFilter: Because of the lower/upper bound issue we cannot really improve this (except for the specific case when min=max)

DictionaryFilter: We only can improve (?) the case when not all the pages are dictionary encoded because otherwise we would not have a Bloom filter so we won't step to the next filter anyway. So the dilemma is whether it worth to load the dictionary (which is potentially large since not all the values in the column chunk can fit in it) or is it better to use Bloom filter only. (The latter one is the current implementation.)

BloomFilterImpl: By nature we do not have a BLOCK_MUST_MATCH option.

@gszadovszky @wgtmac @zhongyujiang Thank you very much for working on it. I have some thoughts.

We can improve (?) the case when not all the pages are dictionary encoded

I can't make sure if it is suitable to load dictionary even if pages are not all decoded. (I may choose not to change this behavior)

However considering the origin BloomFilter bug in parquet v1, we might have to do something to avoid using BloomFilter(even if pages are all encoded).
In the code implementation we may have to use some flag to mark if dictionary DictionaryFilter#expandDictionary successfully (method will throw IOException and we can't expandDictionary again in BloomFilterImpl).
Or we could also use BLOCK_MUST_MATCH like this PR.

StatisticsFilter: Because of the lower/upper bound issue we cannot really improve this (except for the specific case when min=max)

If we only use it when min=max, I think it might not really improve .

yabola · 2023-02-10T15:09:46Z

@wgtmac Thanks for your review, I will add UT later.
Sorry, Boolean type has to be used here, so that we can distinguish the BLOCK_MIGHT_MATCH and BLOCK_MUST_MATCH (we should compare by java references not value) . This is example:

Boolean b1 = new Boolean(true);
Boolean b2 = new Boolean(true);
boolean b3 = true;
boolean b4 = true;

assert b1 != b2;
assert b1.equals(b2);
assert b2 == b3 == b4;

yabola

@wgtmac @shangxinli I add new UT TestRowGroupFilterExactly , if you have time, please take a look, thanks!

yabola · 2023-02-12T15:14:07Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestRowGroupFilterExactly.java

+      .withRecordFilter(FilterCompat.get(filter)).build();
+
+    // simulate the previous behavior, only skip other filters when predicate is BLOCK_CANNOT_MATCH
+    testEvaluation.setTestExactPredicate(Collections.singletonList(BLOCK_CANNOT_MATCH));


simulate the previous behavior, only skip other filters when predicate is BLOCK_CANNOT_MATCH

wgtmac · 2023-02-16T06:01:11Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+  public static Boolean evaluateAnd(Operators.And and, FilterPredicate.Visitor<Boolean> predicate) {
+    Boolean left = and.getLeft().accept(predicate);
+    if (left == BLOCK_CANNOT_MATCH) {
+      // seems unintuitive to put an || not an && here but we can


The comment does not match the code now.

Thanks for your review, I update comments and add more UT for And Or

wgtmac · 2023-02-16T06:02:58Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+    Boolean right = and.getRight().accept(predicate);
+    if (right == BLOCK_CANNOT_MATCH) {
+      return BLOCK_CANNOT_MATCH;
+    } else if (left == BLOCK_MUST_MATCH && right == BLOCK_MUST_MATCH) {


Suggested change

} else if (left == BLOCK_MUST_MATCH && right == BLOCK_MUST_MATCH) {

} else if (left == BLOCK_MUST_MATCH || right == BLOCK_MUST_MATCH) {

Both left and right are not BLOCK_CANNOT_MATCH, so it can return BLOCK_MUST_MATCH if either side is BLOCK_MUST_MATCH

if left is BLOCK_MUST_MATCH , right is BLOCK_MIGHT_MATCH , left & right should be BLOCK_MIGHT_MATCH.
Because in the next filter may let right be BLOCK_CANNOT_MATCH and we should drop it.

And I add new UT
In StatisticsFilter left might match (but can't match in DictionaryFilter), right must match -> return might match in StatisticsFilter, return can't match in DictionaryFilter

I didn't get your point here. @yabola

If the current expression is A and B, then following result applies regardless of other expressions:

A is BLOCK_MUST_MATCH and B is BLOCK_MUST_MATCH => BLOCK_MUST_MATCH

A is BLOCK_MUST_MATCH and B is BLOCK_MIGHT_MATCH => BLOCK_MUST_MATCH

A is BLOCK_MIGHT_MATCH and B is BLOCK_MUST_MATCH => BLOCK_MUST_MATCH

A is BLOCK_MIGHT_MATCH and B is BLOCK_MIGHT_MATCH => BLOCK_MIGHT_MATCH

A is BLOCK_CANNOT_MATCH or/and B is BLOCK_CANNOT_MATCH => BLOCK_CANNOT_MATCH

wgtmac · 2023-02-16T06:05:52Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+      // if left or right operation must need the block, then we must take the block
+      return BLOCK_MUST_MATCH;
+    } else if (left == BLOCK_CANNOT_MATCH && right == BLOCK_CANNOT_MATCH) {
+      // seems unintuitive to put an && not an || here


The comment does not match the code.

wgtmac · 2023-02-20T05:09:35Z

cc @zhongyujiang
Not sure if you are interested in reviewing this PR.

yabola · 2023-02-21T07:02:41Z

@gszadovszky @shangxinli If you have time, please also take a look, thanks~

zhongyujiang · 2023-02-22T02:06:49Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java


-    // drop if value <= min
-    return stats.compareMinToValue(value) >= 0;
+


Minor: extra blank line.

zhongyujiang · 2023-02-22T02:09:07Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java


-    // drop if value < min
-    return stats.compareMinToValue(value) > 0;
+


Minor: extra blank line.

zhongyujiang · 2023-02-22T02:50:46Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java


  private final Map<ColumnPath, ColumnChunkMetaData> columns = new HashMap<ColumnPath, ColumnChunkMetaData>();

-  public static boolean canDrop(FilterPredicate pred, List<ColumnChunkMetaData> columns, BloomFilterReader bloomFilterReader) {


Minor: This line can remain unchanged if we move the #predicate down.

zhongyujiang · 2023-02-22T03:14:16Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

      if (dictSet != null && !dictSet.contains(value)) {
        return BLOCK_CANNOT_MATCH;
      }
+      if (dictSet != null && dictSet.contains(value)) {
+        return BLOCK_MUST_MATCH;
+      }


Suggested change

if (dictSet != null && !dictSet.contains(value)) {

return BLOCK_CANNOT_MATCH;

}

if (dictSet != null && dictSet.contains(value)) {

return BLOCK_MUST_MATCH;

}

if(dictSet != null) {

return dictSet.contains(value) ? BLOCK_MUST_MATCH || BLOCK_CANNOT_MATCH;

}

zhongyujiang · 2023-02-22T03:17:31Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

-  private static final boolean BLOCK_MIGHT_MATCH = false;
-  private static final boolean BLOCK_CANNOT_MATCH = true;
-
-  public static boolean canDrop(FilterPredicate pred, List<ColumnChunkMetaData> columns) {


Minor: This line can remain unchanged if we move the #predicate down.

wgtmac · 2023-02-22T09:07:24Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java

+      Boolean predicate = BLOCK_MIGHT_MATCH;
+      if (levels.contains(FilterLevel.STATISTICS)) {
+        predicate = StatisticsFilter.predicate(filterPredicate, block.getColumns());
+        if(isExactPredicate(predicate)) {


nit: name of isExactPredicate is a little bit unclear.

wgtmac · 2023-02-22T09:12:07Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java

+    Boolean right = and.getRight().accept(predicate);
+    if (right == BLOCK_CANNOT_MATCH) {
+      return BLOCK_CANNOT_MATCH;
+    } else if (left == BLOCK_MUST_MATCH && right == BLOCK_MUST_MATCH) {


I didn't get your point here. @yabola

If the current expression is A and B, then following result applies regardless of other expressions:

A is BLOCK_MUST_MATCH and B is BLOCK_MUST_MATCH => BLOCK_MUST_MATCH

A is BLOCK_MUST_MATCH and B is BLOCK_MIGHT_MATCH => BLOCK_MUST_MATCH

A is BLOCK_MIGHT_MATCH and B is BLOCK_MUST_MATCH => BLOCK_MUST_MATCH

A is BLOCK_MIGHT_MATCH and B is BLOCK_MIGHT_MATCH => BLOCK_MIGHT_MATCH

A is BLOCK_CANNOT_MATCH or/and B is BLOCK_CANNOT_MATCH => BLOCK_CANNOT_MATCH

KE-40948 Add rowGroup filters info

yabola · 2023-03-07T07:15:01Z

@wgtmac @gszadovszky
I have a proposal to automatically build BloomFilter with a more precise size. I create a jira https://issues.apache.org/jira/browse/PARQUET-2254 and I hope to get your opinions, thank you.

Now the usage is to specify the size, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.
I have an idea that the user can specify a maximum BloomFilter filter size, then we build multiple BloomFilter at the same time, we can use the largest BloomFilter as a counting tool( If there is no hit when inserting a value, the counter will be +1, of course this may be imprecise but enough)
Then at the end of the write, choose a BloomFilter of a more appropriate size when the file is finally written.

gszadovszky · 2023-03-07T08:03:57Z

Thanks @yabola for coming up with this idea. Let's continue the discussion about the BloomFilter building idea in the jira.

Meanwhile, I've been thinking about the actual problem as well. Currently, for row group filtering we are checking the min/max values first which is correct since this is the most fast thing to do. Then the dictionary and then the bloom filter. The ordering of the latter two is not obvious to me in every scenarios. At the time of filtering we did not start reading the actual row group so there is no advantage in I/O to read the dictionary first. Furthermore, searching something in the bloom filter is much faster than in the dictionary. And the size of the bloom filter is probably less than the size of the dictionary. Though, it would require some measurements to prove if it is a good idea to get the bloom filter before the dictionary. What do you think?

wgtmac · 2023-03-07T14:31:25Z

Thanks @yabola for coming up with this idea. Let's continue the discussion about the BloomFilter building idea in the jira.

Meanwhile, I've been thinking about the actual problem as well. Currently, for row group filtering we are checking the min/max values first which is correct since this is the most fast thing to do. Then the dictionary and then the bloom filter. The ordering of the latter two is not obvious to me in every scenarios. At the time of filtering we did not start reading the actual row group so there is no advantage in I/O to read the dictionary first. Furthermore, searching something in the bloom filter is much faster than in the dictionary. And the size of the bloom filter is probably less than the size of the dictionary. Though, it would require some measurements to prove if it is a good idea to get the bloom filter before the dictionary. What do you think?

What I did in production is to issue async I/Os of dictionaries (if all data pages are dictionary-encoded in that column chunk and the dictionary is not big) and bloom filters of selected row groups in advance. The reason is to eliminate blocking I/O when pushing down the predicates. However, the parquet specs only records the offset to bloom filter. So I also added the length of each bloom filter in the key value metadata section (probably a good reason to add to the specs as well?)

yabola · 2023-03-07T16:10:53Z

Thanks @yabola for coming up with this idea. Let's continue the discussion about the BloomFilter building idea in the jira.

Meanwhile, I've been thinking about the actual problem as well. Currently, for row group filtering we are checking the min/max values first which is correct since this is the most fast thing to do. Then the dictionary and then the bloom filter. The ordering of the latter two is not obvious to me in every scenarios. At the time of filtering we did not start reading the actual row group so there is no advantage in I/O to read the dictionary first. Furthermore, searching something in the bloom filter is much faster than in the dictionary. And the size of the bloom filter is probably less than the size of the dictionary. Though, it would require some measurements to prove if it is a good idea to get the bloom filter before the dictionary. What do you think?

It is a good idea to adjust filter order and prefer the use of lighter filters first to judge.
But I have some concern (not sure if it is correct):
In parquet dictionary will generate only in low-base data( see parquet.dictionary.page.size 1 MB), and BloomFilter is usually used in high base columns(?) . So ideally only one of these two will be used(?)
And ideally and radically we can only use one of these two (don't judge both of them). If there is a BloomFilter and filter is = or in, only use the BloomFilter(no matter match or not match) , otherwise use the dictionary.

But this lacks practical scenarios, and I am not sure which is a better choice, need to think more about it.

… data (Kyligence#67)

…ence#68) * KE-40433 add page index filter log * KE-40433 release 1.12.2-kylin-r5

gszadovszky and others added 29 commits April 26, 2021 13:50

Prepare next development iteration

4cc5baf

PARQUET-2027: Fix calculating directory offset for merge (apache#896)

76f3594

(cherry picked from commit 2ce35c7)

PARQUET-2064: Make Range public accessible in RowRanges (apache#921)

49f40a3

PARQUET-2022: ZstdDecompressorStream should close zstdInputStream (a…

e8c9acd

…pache#889)

PARQUET-2052: Integer overflow when writing huge binary using diction…

7597c74

…ary encoding (apache#910)

PARQUET-1633: Fix integer overflow (apache#902)

aa132b3

Unit test: - Updated ParquetWriter to support setting row group size in long - Removed Xmx settings in the pom to allow more memory for the tests Co-authored-by: Gabor Szadovszky <[email protected]>

PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFi…

387e138

…le (apache#913) * use try-with-resource statement for ParquetFileReader to call close explicitly

PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats (apache#920

413d1dd

)

PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBas…

6e72dd4

…e. (apache#922)

Update CHANGES.md for 1.12.1 (apache#931)

0d6a49e

PARQUET-2043: Fail for undeclared dependencies (apache#916)

f5595fe

The purpose of this change is to fail the build if some classes are used from not direct dependencies. Only classes from direct dependencies shall be used. Also fixed some references that broke this rule.

Remove unused dependency & Fix warning message

261e320

[maven-release-plugin] prepare release apache-parquet-1.12.1-rc0

d1dccf6

[maven-release-plugin] prepare for next development iteration

9455f48

Revert "Remove unused dependency & Fix warning message"

afd4904

This reverts commit 261e320.

Revert "PARQUET-2043: Fail for undeclared dependencies (apache#916)"

64544c1

[maven-release-plugin] prepare release apache-parquet-1.12.1-rc1

2a5c06c

[maven-release-plugin] prepare for next development iteration

16d7608

[maven-release-plugin] prepare for next development iteration

ef6274f

PARQUET-2094: Handle negative values in page headers (apache#933)

afb64b8

(cherry picked from commit 1695d92)

Update CHANGES.md for 1.12.2

a113189

[maven-release-plugin] prepare release apache-parquet-1.12.2-rc0

77e30c8

reduce ByteBuffer allocation

c17fc98

add single record size check

bf398e3

optimize snappy decompress

d9333f6

maven maven-source-plugin

19aec9d

update version to 1.12.2-kylin-r1

d9a2d1c

Revert "mior, add local nexus" (Kyligence#15)

bd9e6fe

This reverts commit 0f6fc7f.

wgtmac reviewed Feb 4, 2023

View reviewed changes

yabola force-pushed the fix2 branch 3 times, most recently from abfb2fa to 5dfd8b4 Compare February 7, 2023 00:56

wgtmac reviewed Feb 10, 2023

View reviewed changes

yabola force-pushed the fix2 branch from a6fa8d8 to bea9dd4 Compare February 11, 2023 01:28

yabola commented Feb 12, 2023

View reviewed changes

wgtmac reviewed Feb 16, 2023

View reviewed changes

zhongyujiang reviewed Feb 22, 2023

View reviewed changes

wgtmac reviewed Feb 22, 2023

View reviewed changes

This was referenced Feb 24, 2023

PARQUET-2251 Avoid generating Bloomfilter when all pages of a column are encoded by dictionary in parquet v1 #1033

Merged

PARQUET-2237 Improve performance by skipping BloomFilter when column has a dictionary filter #1039

Closed

KE-40948 Add RowGroup filters info (Kyligence#66)

c7154bd

KE-40948 Add rowGroup filters info

yabola and others added 4 commits April 20, 2023 15:36

PARQUET-2254 Support building dynamic bloom filter that adapts to the…

26fc72e

… data (Kyligence#67)

KE-40433 add page index filter log and release 1.12.2-kylin-r5 (Kylig…

894234b

…ence#68) * KE-40433 add page index filter log * KE-40433 release 1.12.2-kylin-r5

change distributionManagement (Kyligence#70)

1f2ac54

KE-41399 Avoid parquet footer reads twice in vectorized reader

6e2172c

yabola force-pushed the fix2 branch from ea7dc8c to 6e2172c Compare May 18, 2023 07:54

yabola added 2 commits May 18, 2023 15:54

KE-41399 Avoid parquet footer reads twice in vectorized reader

b659614

KE-41399 Avoid parquet footer reads twice in vectorized reader

8f6047b

		Boolean predicate = BLOCK_MIGHT_MATCH;

		if (levels.contains(FilterLevel.STATISTICS)) {

	private <T extends Comparable<T>> Boolean predicate(Set<T> dictSet, Set<T> values) {
	private <T extends Comparable<T>> Boolean evaluate(Set<T> dictSet, Set<T> values) {

	} else if (left == BLOCK_MUST_MATCH && right == BLOCK_MUST_MATCH) {
	} else if (left == BLOCK_MUST_MATCH \|\| right == BLOCK_MUST_MATCH) {


		// drop if value <= min
		return stats.compareMinToValue(value) >= 0;


		// drop if value < min
		return stats.compareMinToValue(value) > 0;


		private final Map<ColumnPath, ColumnChunkMetaData> columns = new HashMap<ColumnPath, ColumnChunkMetaData>();

		public static boolean canDrop(FilterPredicate pred, List<ColumnChunkMetaData> columns, BloomFilterReader bloomFilterReader) {

PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly #1023

Are you sure you want to change the base?

PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly #1023

Uh oh!

Conversation

yabola commented Feb 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Feb 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Feb 6, 2023

Uh oh!

shangxinli commented Feb 6, 2023

Uh oh!

yabola commented Feb 7, 2023

Uh oh!

yabola commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yabola commented Feb 4, 2023 •

edited

Loading

yabola Feb 5, 2023 •

edited

Loading

yabola commented Feb 8, 2023 •

edited

Loading

yabola Feb 10, 2023 •

edited

Loading

yabola Feb 21, 2023 •

edited

Loading

yabola Feb 21, 2023 •

edited

Loading

yabola Feb 22, 2023 •

edited

Loading

yabola Feb 22, 2023 •

edited

Loading

yabola commented Feb 10, 2023 •

edited

Loading

yabola left a comment •

edited

Loading

yabola Feb 17, 2023 •

edited

Loading