DRILL-5796: Filter pruning for multi rowgroup parquet file #1298

jbimbert · 2018-05-31T14:00:55Z

In ParquetFilterPredicate, replaced canDrop with ROWS_MATCH enum to keep inside rowgroup the filter result information.
This information allows to prune the filter when all rows match.

priteshm · 2018-06-01T16:25:52Z

@arina-ielchiieva, @vdiravka could one of you guys please review this?

vrozov · 2018-06-03T16:30:14Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicates.java

 /**
 * Comparison predicates for parquet filter pushdown.
 */
 public class ParquetComparisonPredicates {


There is too much code duplicated in the original implementation and it needs to be refactored. Please see PR #1259.

Hello VRozov,
you're right. I though less code modification was better (my preceding team was afraid of code refactoring...). I have refactored the class.
best regards

arina-ielchiieva · 2018-06-04T10:44:33Z

.../test/java/org/apache/drill/exec/store/parquet/TestParquetFilterPushDownForComplexTypes.java

-    testParquetFilterPushDown("t.`user`.hobby_ids[0] <> 1", 8, 6);
-    testParquetFilterPushDown("t.`user`.hobby_ids[2] > 20", 5, 3);
-    testParquetFilterPushDown("t.`user`.hobby_ids[0] between 10 and 20", 5, 4);
+    testParquetFilterPushDown("t.`user`.hobby_ids[0] <> 1", 8, 7);


Why the expected output was changed?

Hello Arina,
after last refactoring, the output are not changed anymore. Thank you for noticing.
best regards

arina-ielchiieva · 2018-06-04T13:44:33Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java

-          return null;
-        }
-        logger.debug("materializedFilter : {}", ExpressionStringBuilder.toString(materializedFilter));
+        final ColumnExplorer columnExplorer = new ColumnExplorer(optionManager, columns);


Please revert change indention change. Should be 2.

arina-ielchiieva · 2018-06-04T13:46:26Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetBooleanPredicates.java

      for (LogicalExpression child : this) {
-        if (child instanceof ParquetFilterPredicate && ((ParquetFilterPredicate) child).canDrop(evaluator)) {
-          return true;
+        m = ((ParquetFilterPredicate) child).matches(evaluator);


It would be better if variables m, temp would have meaning names, please consider the same in other places.

I believe it's better to leaving instanceof check.

Please consider re-writing below if statements to be more simple.

arina-ielchiieva · 2018-06-04T13:53:52Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicates.java

+
    protected final LogicalExpression right;

+    protected Statistics leftStat, rightStat;


Looks like in Drill we prefer to introduce each variable from the new line. Please consider applying the same in your code.

arina-ielchiieva · 2018-06-04T13:54:48Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicates.java

-        return true;
-      }
-
+    protected ROWS_MATCH matchesCond() {


Please use full name for condition.

arina-ielchiieva · 2018-06-04T13:55:26Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetFilterPredicate.java

+   *   <li>MISLEAD : filter can not be applied
+   * </ul>
+   */
+  enum ROWS_MATCH {ALL, NONE, SOME, MISLEAD}


Consider renaming mislead to not applicable or something similar.

arina-ielchiieva · 2018-06-04T13:57:57Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/RowGroupInfo.java

-  }
+  public void setColumns(List<? extends ColumnMetadata> columns) { this.columns = columns; }
+
+  public ParquetFilterPredicate.ROWS_MATCH getRowValid() { return rowValid; }


Please consider renaming the setter and setter method to more obvious.

arina-ielchiieva · 2018-06-04T14:00:51Z

exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetFilterPushDown.java

+    // multirowgroup2 is a parquet file with 3 rowgroups inside. One with a=0, another with a=1 and a=2, and the last with a=3;
+    // FilterPushDown should be able to prune the filter from the scan operator according to the rowgroup statistics.
+    final String sql = "select * from dfs.`parquet/multirowgroup2.parquet` where ";
+    PlanTestBase.testPlanMatchingPatterns(sql + "a > 1", new String[]{"numRowGroups=2"}, new String[]{}); //No filter pruning


You can pass null instead of new String[]{} or consider using overloaded method without excluded pattern.

arina-ielchiieva · 2018-06-04T14:01:37Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java

-    final RelNode newFilter = filter.copy(filter.getTraitSet(), ImmutableList.<RelNode>of(newScan));
-    call.transformTo(newFilter);
+
+    List<RowGroupInfo> rowGroupInfos = ((AbstractParquetGroupScan)newGroupScan).rowGroupInfos;


Please add space after (AbstractParquetGroupScan).

Please try to get rid of explicit casting.

arina-ielchiieva · 2018-06-04T16:22:52Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetFilterPredicate.java

+   *   <li>UNAPPLICABLE : filter can not be applied
+   * </ul>
+   */
+  enum ROWS_MATCH {ALL, NONE, SOME, UNAPPLICABLE}


inapplicable :)

arina-ielchiieva · 2018-06-05T10:48:54Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetBooleanPredicates.java

      for (LogicalExpression child : this) {
-        if (child instanceof ParquetFilterPredicate && ((ParquetFilterPredicate) child).canDrop(evaluator)) {
-          return true;
+        childMatch = ((ParquetFilterPredicate) child).matches(evaluator);


Please check of instanceof as was done before.

Done
Even if it can not be something else...

arina-ielchiieva · 2018-06-05T10:49:11Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetBooleanPredicates.java

-        // "long" : as long as one branch is NOT ok to drop, we can NOT drop it.
-        if (! ((ParquetFilterPredicate) child).canDrop(evaluator)) {
-          return false;
+        childMatch = ((ParquetFilterPredicate) child).matches(evaluator);


Please check for instanceof as was done before.

Done. Even if it was not there before...

arina-ielchiieva · 2018-06-05T10:49:34Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetFilterPredicate.java

+   * </ul>
+   */
+  enum ROWS_MATCH {ALL, NONE, SOME, INAPPLICABLE
+  }


Please move to the previous line.

arina-ielchiieva · 2018-06-05T11:37:39Z

@jbimbert overall looks good, except of several minor comments. Please address them and squash the commits into one.

Also I have run Functional and Advanced tests for your branch and there is one test failure:

Query: /root/drillAutomation/drill-test-framework/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.sql
select * from dfs.drillTestDir.`parquet_storage/DRILL_6259/DRILL_6259_test_data` t where t.top_level_array[2] >= 20

Baseline: /root/drillAutomation/drill-test-framework/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.e_tsv
         Expected number of rows: 1
Actual number of rows from Drill: 2
         Number of matching rows: 1
          Number of rows missing: 0
       Number of rows unexpected: 1

These rows are not expected (first 10):
d3	3	1487165823000	[31,32]	{"nested_array":[false,true,false,false],"nested_complex_field":{"nested_array":[true,false,false,true]},"nested_list_of_complex_fields":[{"nested_array":[1131,1132,1133,1134]},{"nested_field":1485400063000,"nested_array":[1134,1135,1136,1137]},{"nested_field":1485488351000,"nested_array":[1137,1138,1139,1140]}]}	[{"nested_array":[131,132,133,134],"nested_complex_field":{"nested_array":[1485576639000,1485663099000]},"nested_list_of_complex_fields":[{"nested_array":[1131,1133]},{"nested_field":true,"nested_array":[1134,1135,1136,1137]},{"nested_field":false,"nested_array":[1137,1138,1139]}]},{"nested_field":true,"nested_array":[321,322,323],"nested_complex_field":{"nested_field":321,"nested_array":[1488255039000,1488255099000,1488341499000]},"nested_list_of_complex_fields":[{"nested_field":false,"nested_array":[1321,1322,1323]},{"nested_field":true,"nested_array":[1324,1325,1326]},{"nested_field":false,"nested_array":[1327,1328,1329]}]},{"nested_field":true,"nested_array":[331,332,333],"nested_complex_field":{"nested_field":331,"nested_array":[1489774239000,1489774299000,1489860699000]},"nested_list_of_complex_fields":[{"nested_field":false,"nested_array":[1331,1332,1333]},{"nested_field":true,"nested_array":[1334,1335,1336]},{"nested_field":true,"nested_array":[1337,1338,1339]}]}]

Data - https://github.com/mapr/drill-test-framework/tree/master/framework/resources/Datasources/parquet_storage/DRILL_6259/DRILL_6259_test_data
Query - https://github.com/mapr/drill-test-framework/blob/master/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.sql
Expected result - https://github.com/mapr/drill-test-framework/blob/master/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.e_tsv

Please fix it.

arina-ielchiieva · 2018-06-05T11:54:58Z

@agozhiy could you please confirm that expected result is correct? Or maybe we should expect two rows?

agozhiy · 2018-06-05T12:12:33Z

@arina-ielchiieva, the d3 row shouldn't have been returned by the query, as top_level_array[2] value for it is null.

arina-ielchiieva · 2018-06-05T12:19:18Z

Thanks, Anton.

jbimbert · 2018-06-11T14:27:13Z

Tests DRILL_6259 are now OK.
The issue came from a query of the form "select * from t where t.col[2] > 1", where col[] is an array; in this case, we can't deduce the value of the column from the metadata of the row group, especially is col[2] is null for example.
To correct this issue, I chose to set filter result as ROWS_MATCH.SOME instead of ROWS_MATCH.ALL for array columns only.
This should impact only queries on array columns, which will run as usual (with no filter pruning).

arina-ielchiieva · 2018-06-12T06:49:50Z

@jbimbert

Please add comment in that place in code where you fixed issue with array, explaining this case. Since you did not add fix in a separate commit, it's hard to tell where you made the change.
Please resolve the conflicts.
Please add check for instanceof (I am writing this point for the third time!).
Please squash commits into one.

jbimbert · 2018-06-12T09:20:23Z

The commit 0ac0825 has changed the same files as mine. I have to reconsider the solution. It will take a few days.
@arina : after the rebase do you prefer me to keep the same PR or to recreate one, as the modifications should be huge?

arina-ielchiieva · 2018-06-12T10:35:38Z

I don't see any reasons for creating new PR. Just update this one, do force push and keep one commit.

jbimbert · 2018-06-14T15:48:35Z

Hello,
Changes have been implemented. All comment taken into account (PR updated, force push and only one commit).
Automatic integration tests have failed with the following error : The job exceeded the maximum time limit for jobs, and has been terminated.
Maybe AI tests should be relaunched (I don't know how to do that).

arina-ielchiieva · 2018-06-14T16:18:23Z

...va-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java

+   * @param fields columns used in the filter.
+   * @return true if one at least is an array, false otherwise.
+   */
+  public boolean isRepeated(Set<SchemaPath> fields) {


Where did this come from? I did not do such changes in DRILL-6259.

I added this function to solve the issue of DRILL_6259_test_data and comply with your comment : "Please add comment in that place in code where you fixed issue with array, explaining this case. Since you did not add fix in a separate commit, it's hard to tell where you made the change"

arina-ielchiieva · 2018-06-14T16:21:56Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetGroupScan.java

        continue;
      }
+      // DRILL_6259 : in case of array, we can't know if the real value of a repeated column exists until it is really filtered, hence we can't prune the filter
+      if (rowGroup.getRowsMatch() == ROWS_MATCH.ALL && statCollector.isRepeated(schemaPathsInExpr)) {


statCollector.isRepeated(schemaPathsInExpr)) -> I am not sure this is right place to do the check. Previously all was done during predicate evaluation, I think moving this here is quite incorrect. Please re-consider.

vrozov · 2018-06-14T16:50:42Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetBooleanPredicate.java

 */
-public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator
-    implements ParquetFilterPredicate<C> {
+public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator implements ParquetFilterPredicate<C> {


avoid format only changes unless something is wrong with the way how the code is formatted.

Completely agreed.
I just applled the intellij-idea-settings.jar which is provided by DRILL team.
As you may notice, I didn't apply it to the other files that are modified in this PR, to avoid having too many changes which are not meaningful.

That intellij-idea-settings.jar should be applied to the part of the code that is modified, not to the entire file or other files. Please revert back all format only changes.

vrozov · 2018-06-14T16:56:52Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicate.java

    super(left.getPosition());
    this.left = left;
    this.right = right;
-    this.predicate = predicate;


What is/was wrong with the functional approach? Why is it necessary to use inheritance when delegation should work just fine here?

The BiPredicate is perfect in case of boolean, which is not the case here. I return an enum which has no negate() for example...
Of course, I could create a new functional interface specifically for this enum, with only one function : test(), but it would only make the code more complicate.
I agree with you on this point, delegation is better than inheritance.

There is no need to create a new functional interface. Use BiFunction (optionally define RawMatchFunction<C> as a concrete specification of the BiFunction<Statistics<C>, Statistics<C>, ROWS_MATCH>)

vrozov · 2018-06-15T14:21:00Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetBooleanPredicate.java

 */
-public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator
-    implements ParquetFilterPredicate<C> {
+public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator implements ParquetFilterPredicate<C> {


That intellij-idea-settings.jar should be applied to the part of the code that is modified, not to the entire file or other files. Please revert back all format only changes.

vrozov · 2018-06-15T14:30:22Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicate.java

    super(left.getPosition());
    this.left = left;
    this.right = right;
-    this.predicate = predicate;


There is no need to create a new functional interface. Use BiFunction (optionally define RawMatchFunction<C> as a concrete specification of the BiFunction<Statistics<C>, Statistics<C>, ROWS_MATCH>)

vrozov · 2018-06-15T14:51:38Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/RangeExprEvaluator.java

      case INT:
-        return getStatistics( ((IntHolder)minFuncHolder).value, ((IntHolder)maxFuncHolder).value);
+        Statistics si = getStatistics( ((IntHolder)minFuncHolder).value, ((IntHolder)maxFuncHolder).value);
+        si.setNumNulls(input.getNumNulls());


move si.setNumNulls() outside of the switch block.

Before switch: Statistics<?> statistics;;
in each case except for default: statistics = getStatistics(...);
after switch statistics.setNumNulls(); return statistics;

vrozov · 2018-06-15T15:27:08Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetFilterPredicate.java

+   *   <li>ALL : all rows match the filter (can not drop the row group and prune filter)
+   *   <li>NONE : no row matches the filter (can drop the row group)
+   *   <li>SOME : some rows only match the filter (can not drop the row group not the filter)
+   *   <li>INAPPLICABLE : filter can not be applied


Suppose I have set/group of rows and a filter. Applying the filter to the set of rows may result in an empty set (NONE), partial set (SOME) and the same set as original one (ALL). What does INAPPLICABLE mean? Does it mean the result is unknown? But does it really matter whether it is partial or unknown?

Good question, which we discussed here as well.
The answer is, it does not matter right now. It would matter if we want to stat the filter accurately for example, or if we want to have a sample of results only (faster but inaccurate).
Suppressing INAPPLICABLE is possible, but we'd loose some information of course. keeping it is a plus.

Can you point me to the prior discussion? I don't see what information would be lost if INAPPLICABLE is merged with SOME and SOME means that a result of applying a filter to a set of rows is unknown (empty, all or partial). @arina-ielchiieva what is your take?

Agree with Vlad, I don't see any value of keeping inapplicable.

Merged INAPPLICABLE into SOME.
Done

vrozov · 2018-06-15T15:30:43Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetFilterPredicate.java

+   *   <li>INAPPLICABLE : filter can not be applied
+   * </ul>
+   */
+  enum ROWS_MATCH {ALL, NONE, SOME, INAPPLICABLE}


It is more common for Drill to use all capital formatting only for enum values, the enum itself uses Camel notation.

vrozov · 2018-07-13T16:17:47Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java

    return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
-        //if min value is not false or if there are all nulls  -> canDrop
-        isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+      exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat)


Under what condition hasNonNullValue() && isAllNulls() will be true? What unit test covers this use case?

hasNonNullValue = true if min and max exist
isAllNulls = true if all rows are null values
testBooleanPredicate with File 0_0_3.parquet (contains only 3 null values)

@jbimbert

If all rows are null, what are the values for min and max, should not hasNonNullValue be false?

Please point me to the specific test that validates that condition.

I would prefer to see a unit test, not an integration test. For this particular case, the integration test validates results of a query, but it does not validate what is the result of ((ParquetFilterPredicate<Boolean>)createIsFalsePredicate(expr)).canDrop(evaluator) is.

@vrozov this is valid case, the same issue if described in DRILL-6603, we missed hasNonNullValue check for is null predicate and filtered out row group that we shouldn't.
@jbimbert isAllNulls is used more often in your code then in previous version. In previous version it was used only three times and each time had hasNonNullValue check. I am afraid in your code checks without hasNonNullValue may produce incorrect results. I am thinking what if we modify isAllNulls method to include hasNonNullValue check inside?

@vrozov java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_3.parquet gives ST:[num_nulls: 3, min/max not defined] whihc means hasNonNullValue=false but isAllNulls=true, which is what we are looking for. This happens each time we run
testBooleanPredicate()
@arina-ielchiieva isallNulls would produce incorrect values if getNumNulls is undefined, which may happen whatever the return value of hasNonNullValue (see stats of parquet file 0_0_3.parquet). Would it be possible that all the stats (num_nulls, min and max) are not defined for ParquetIsPredicate entering file?

@jbimbert to answer this question you can analyze org.apache.parquet.column.statistics.Statistics and see what can or cannot be done.

@jbimbert @arina-ielchiieva OK, I miscounted parentheses in the modified expression.
@jbimbert What is the reason you changed the order of isAllNulls() and getMin()? Unless there is a reason, please keep the original order.

What unit test covers the following cases: (min:false, max:false, all nulls:false) and (min:false, max:true, all nulls:false)? What is the expected output in the first case and the second case?

@vrozov Yes, to have the same code in all lines. Easier to maintain.
Maybe you want java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_2.parquet with ST:[min: false, max: false, num_nulls: 0]
and java-exec/src/test/resources/parquetFilterPush/tfTbl/tt1.parquet with ST:[min: false, max: true, num_nulls: 0]
which are tested in testBooleanPredicate()

@jbimbert

The original code was also the same, no need to change the order.

Please check the use cases I provided.

arina-ielchiieva · 2018-07-16T10:38:10Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java

+            return RowsMatch.SOME;
+          }
+        }
+        if (hasNoNulls(exprStat)) {


@jbimbert, @vrozov
https://issues.apache.org/jira/browse/DRILL-6603 use case shows that statistics can be defined in the following way:
stat.isEmpty() returns false
stat.getNumNulls() returns 0
stat.genericGetMin() returns null
stat.genericGetMax() returns null
stat.toString returns num_nulls: 0, min/max not defined
In this case we mistakenly decide that such row group does not contain null values and filter out such rows group but in reality, row group contains nulls.
Statistics toString method is defined as follows:

@Override public String toString() { if (this.hasNonNullValue()) { if (isNumNullsSet()) { return String.format("min: %s, max: %s, num_nulls: %d", minAsString(), maxAsString(), this.getNumNulls()); } else { return String.format("min: %s, max: %s, num_nulls not defined", minAsString(), maxAsString()); } } else if (!this.isEmpty()) return String.format("num_nulls: %d, min/max not defined", this.getNumNulls()); else return "no stats for this column"; }

As you can see toString method also uses stat.hasNonNullValue() to show if stat is defined.
In order to correctly apply is null we need to use stat.hasNonNullValue().

In testBooleanPredicate, file java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_3.parquet which has ST:[num_nulls: 3, min/max not defined] and "row group 1: RC:3 TS:27 OFFSET:4", we get the correct value isAllNulls=true, because we compare getNumNulls with rowCount, which are both 3 in this case.

@jbimbert as mentioned https://issues.apache.org/jira/browse/DRILL-6603 @kkhatua tried your branch on the failed test and test was still failing because stat.hasNonNullValue() is missed.

Dataset - http://drill-public.s3.amazonaws.com/data-shapes/wide-columns/5000/100000rows/parquet/widestrings.tar.gz

@arina-ielchiieva thanks for the file. Running it, it happens that the parquet files have ST:[no stats for this column] and this is interpreted as "num_nulls=0, min/max undefined" in RangeExprEvaluator.visitTypedFieldExpr:
if (columnStatistics != null) {
return columnStatistics.getStatistics();
}
...
Which leads to the issue, since num_nulls is not defined in this case from the parquet rowgroup stats, but certainly not 0 (in this case, num_nulls > 0).

@arina-ielchiieva I do not suggest that Drill does not support filter push down for parquet files created by any prior version. My suggestion is to see (debug) where num_nulls is not set correctly. In any case, it is not valid to rely on hasNonNullValue() as hasNonNullValue() may be false or true and num_nulls be 0 for both cases.

Note that hasNoNulls() is the inversion of hasNulls() condition, but the exact meaning is that either number of nulls is 0 or number of nulls is not set.

@vrozov it turns out, I was mislead by the assumption that parquet files had incorrectly defined num of nulls. It turned out that actually it was Drill that read parquet stats and then re-created stats objects with num of nulls set to 0 instead of original -1.I'll prepare separate PR with the fix for it.
@jbimbert sorry for the confusion and thank you for your patience, please remove stat.hasNonNullValue() checks where you have added them.

@arina-ielchiieva If I remove hasNonNullValue from createIsNullPredicate, the test mentioned in https://issues.apache.org/jira/browse/DRILL-6603 fails.
I can remove it in ParquetComparisonPredicate.checkNull(), but this function is used a lot...
I can also remove it from ParquetIsPredicate.checkNull(), with the same comment.
For safety, until the bug is fixed, we maybe better keep it as is ?
(In the other places, it is used to ensure getMin/getMax is valid)

I am working on PR to fix test failure mentioned in DRILL-6603.
Please do not remove stat.hasNonNullValue() where it originally existed. Just remove where you have added the check when I asked you yesterday.

Done. hasNonNullValue(0 removed in both checkNull() and createIsNullPredicate.

vrozov · 2018-07-16T15:29:23Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java

    return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
-        //if min value is not false or if there are all nulls  -> canDrop
-        isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+      exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat)


@jbimbert

The original code was also the same, no need to change the order.

Please check the use cases I provided.

vrozov · 2018-07-16T16:43:35Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetComparisonPredicate.java

-      return true;
+      return RowsMatch.NONE;
+    }
+    if (leftStat.hasNonNullValue() && !leftStat.hasNonNullValue() || rightStat.hasNonNullValue() && !rightStat.hasNonNullValue()) {


When does the condition evaluate to true?

Yes I noticed but the commit was already pushed and I had to wait for the tests (45 mn).
Corrected in the new push.

Please do your own code review and address all review concerns and let me know when the PR is ready for the code review.

arina-ielchiieva · 2018-07-19T10:11:46Z

@vrozov what status of this PR?

vrozov · 2018-07-19T14:59:56Z

I am waiting for @jbimbert to address review comments and notify that the PR is ready for the review.

jbimbert · 2018-07-19T15:38:52Z

@vrozov Hello, please review this PR

vrozov · 2018-07-19T16:17:50Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java

    return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
-        //if max value is not true or if there are all nulls  -> canDrop
-        isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
+      exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat)


Please make sure to keep the original order of expressions unless it needs to be changed for a reason. If there is a reason to change it here or in any other places, please provide an explanation. I don't see why it needs to be changed.

OK. Reversed.

vrozov · 2018-07-19T16:26:44Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java

    return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
-        //if min value is not false or if there are all nulls  -> canDrop
-        isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+      isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() ? RowsMatch.NONE : checkNull(exprStat)


Please see my comment

What unit test covers the following cases: (min:false, max:false, all nulls:false) and (min:false, max:true, all nulls:false)? What is the expected output in the first case and the second case?
I still don't understand how ALL can be returned here without checking max to be false in a case when number of nulls is 0.

blnTbl/0_0_1.parquet => ST:[min: false, max: false, num_nulls: 0] : 8 tests in testBooleanPredicate()
tfTbl/ft0.parquet => ST:[min: false, max: true, num_nulls: 0] : 4 tests in testBooleanPredicate

example1:
select * from ava-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_1.parquet where col_bln is false returns (false, false, false)

example2:
select * from java-exec/src/test/resources/parquetFilterPush/tfTbl/ft0.parquet where a is true[resp. false] return true[resp. false]

Finally, when running the query
select * from dfs.tmp.blnTbl where col_bln is false
with blnTbl contains only 0_0_0.parquet (T,T,T) and 0_0_1.parquet (F,F,F)
the physical plan reads:
00-00 Screen : rowType = RecordType(DYNAMIC_STAR ): rowcount = 3.0, cumulative cost = {9.3 rows, 12.3 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 523
00-01 Project(=[$0]) : rowType = RecordType(DYNAMIC_STAR ): rowcount = 3.0, cumulative cost = {9.0 rows, 12.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 522
00-02 Project(=[$0]) : rowType = RecordType(DYNAMIC_STAR **): rowcount = 3.0, cumulative cost = {6.0 rows, 9.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 521
00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/blnTbl/0_0_1.parquet]], selectionRoot=file:/tmp/blnTbl, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[**]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY col_bln): rowcount = 3.0, cumulative cost = {3.0 rows, 6.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 520
No more filter since it returns NONE for 0_0_0.parquet and ALL for 0_0_1.parquet.

I don't understand how it works for the use cases you provided, but I would expect the following unit test to pass while it actually fails:

public void testSOME() { LogicalExpression le = mock(LogicalExpression.class); BooleanStatistics booleanStatistics = mock(BooleanStatistics.class); doReturn(booleanStatistics).when(le).accept(ArgumentMatchers.any(), ArgumentMatchers.any()); RangeExprEvaluator<Boolean> re = mock(RangeExprEvaluator.class); when(re.getRowCount()).thenReturn(Long.valueOf(2)); // 2 rows when(booleanStatistics.isEmpty()).thenReturn(FALSE); // stat is not empty when(booleanStatistics.isNumNullsSet()).thenReturn(TRUE); // num_nulls set when(booleanStatistics.getNumNulls()).thenReturn(Long.valueOf(0)); // no nulls when(booleanStatistics.hasNonNullValue()).thenReturn(TRUE); // min/max set when(booleanStatistics.getMin()).thenReturn(Boolean.FALSE); // min false when(booleanStatistics.getMax()).thenReturn(TRUE); // max true assertEquals(SOME, ((ParquetIsPredicate<Boolean>) createIsPredicate(IS_TRUE, le)).matches(re)); assertEquals(SOME, ((ParquetIsPredicate<Boolean>) createIsPredicate(IS_FALSE, le)).matches(re)); }

java.lang.AssertionError: Expected :SOME Actual :ALL

I would prefer to see similar unit tests and not the integration tests that you point me to and that may succeed for a different reason (I'd like to see an explanation why they succeed as well).

OK I found the reason why the tests pass :

we need several parquet files ,else the process is squeezed in AbstractParquetGroupScan.applyFilter.

We need that some parquet files are dropped again in AbstractParquetGroupScan.applyFilter
if (qualifiedRGs.size() == rowGroupInfos.size() ) { return null } ...

If one at least of the row groups is SOME, then the filter is applied to all, in ParquetPushDownFilter.doOnMatch L 179
These 3 conditions together make that the tests pass.
One way to check that they fail, is to put ft0.parquet, ft0.parquet and tt1.parquet in the same folder and run a IS TRUE predicate. the result then reads F, T, F, T (wrong) instead of F, F (expected) !

I have then written the IS TRUE, IS FALSE, IS NOT TRUE and IS NOT FALSE predicates based on the cases:
a. ST:[min: true, max: true, num_nulls: ?]
b. ST:[min: false, max: false, num_nulls: ?]
c. ST:[min: false, max: true, num_nulls: ?]
d. and num_nulls = RC ( row count)
And check all cases.

I also introduced 4 helper functions for code readability: minIsTrue, minIsFalse, maxIsTrue and maxIsFalse

Please add unit testing. As you can see, integration tests may result in false positive.

Done added 12 unit tests for cases
a. ST:[min: true, max: true, num_nulls: 0]
b. ST:[min: false, max: false, num_nulls: 0]
c. ST:[min: false, max: true, num_nulls: 0]

vrozov · 2018-07-24T15:15:55Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java

+  /**
+   * Return true if exprStat.getMin is defined and true
+   */
+  private static Boolean minIsTrue(Statistics exprStat) { return exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin(); }


why Boolean?

Functions suppressed because of next comment

vrozov · 2018-07-24T15:36:03Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java

-        isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-    );
+    return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> {
+      if (isAllNulls(exprStat, evaluator.getRowCount()) || maxIsFalse(exprStat)) {


Both maxIsFalse and minIsFalse check for hasNonNullValue. Consider re-writing with a single check for hasNonNullValue:

if all nulls, return NONE

if min/max undefined, return SOME

return NONE/SOME/ALL based on min/max.

vrozov · 2018-07-24T16:08:07Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetPushDownFilter.java

+      call.transformTo(newFilter);
+    }
+
    final RelNode newFilter = filter.copy(filter.getTraitSet(), ImmutableList.<RelNode>of(newScan));


How it works in case newGroupScan is not an instance of AbstractParquetGroupScan? Will not filter.copy be called twice?

Exact. Suppressed duplicate lines

…lter result information, used to prune the filter if all rows match.

jbimbert · 2018-07-27T07:23:41Z

@vrozov Hello, please review this PR. Best regards

vrozov · 2018-07-27T15:01:48Z

@jbimbert Please fix merge/rebase issue.

jbimbert · 2018-07-30T08:32:18Z

@vrozov Done.

vrozov

ready to commit

…lter result information, used to prune the filter if all rows match. closes apache#1298

vrozov reviewed Jun 3, 2018

View reviewed changes

arina-ielchiieva self-requested a review June 4, 2018 10:18

arina-ielchiieva reviewed Jun 4, 2018

View reviewed changes

arina-ielchiieva requested changes Jun 4, 2018

View reviewed changes

arina-ielchiieva reviewed Jun 4, 2018

View reviewed changes

arina-ielchiieva reviewed Jun 5, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch from 2510ada to 2344149 Compare June 11, 2018 14:03

jbimbert force-pushed the DRILL-5796 branch 2 times, most recently from 85c4cf3 to d50fd56 Compare June 14, 2018 14:21

arina-ielchiieva closed this Jun 14, 2018

arina-ielchiieva reopened this Jun 14, 2018

arina-ielchiieva reviewed Jun 14, 2018

View reviewed changes

vrozov reviewed Jun 14, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch from d50fd56 to 685b777 Compare June 15, 2018 09:23

vrozov suggested changes Jun 15, 2018

View reviewed changes

vrozov reviewed Jun 15, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch from 685b777 to bbfaeeb Compare June 15, 2018 15:21

vrozov reviewed Jun 15, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch from c78ac9a to dc7ce09 Compare July 13, 2018 09:32

vrozov reviewed Jul 13, 2018

View reviewed changes

arina-ielchiieva requested changes Jul 16, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch from dc7ce09 to 03d9c96 Compare July 16, 2018 15:06

arina-ielchiieva approved these changes Jul 16, 2018

View reviewed changes

vrozov suggested changes Jul 16, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch from 03d9c96 to e47d8fb Compare July 16, 2018 15:54

vrozov reviewed Jul 16, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch 2 times, most recently from fd2dc65 to 8952c1a Compare July 17, 2018 13:32

vrozov reviewed Jul 19, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch 2 times, most recently from 40f3495 to 98f1a89 Compare July 23, 2018 08:51

vrozov reviewed Jul 24, 2018

View reviewed changes

jbimbert force-pushed the DRILL-5796 branch from 98f1a89 to 7ffef9b Compare July 24, 2018 17:16

DRILL-5796 : implement ROWS_MATCH enum to keep inside rowgroup the fi…

0f64e6c

…lter result information, used to prune the filter if all rows match.

jbimbert force-pushed the DRILL-5796 branch from 3d06a9e to 0f64e6c Compare July 30, 2018 07:29

vrozov approved these changes Jul 31, 2018

View reviewed changes

ilooner closed this in efd6d29 Aug 1, 2018

mattpollack pushed a commit to mattpollack/drill that referenced this pull request Feb 25, 2019

DRILL-5796 : implement ROWS_MATCH enum to keep inside rowgroup the fi…

2a659cf

…lter result information, used to prune the filter if all rows match. closes apache#1298


		protected final LogicalExpression right;

		protected Statistics leftStat, rightStat;

DRILL-5796: Filter pruning for multi rowgroup parquet file #1298

DRILL-5796: Filter pruning for multi rowgroup parquet file #1298

Uh oh!

Conversation

jbimbert commented May 31, 2018

Uh oh!

priteshm commented Jun 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arina-ielchiieva commented Jun 5, 2018

Uh oh!

arina-ielchiieva commented Jun 5, 2018

Uh oh!

agozhiy commented Jun 5, 2018

Uh oh!

arina-ielchiieva commented Jun 5, 2018

Uh oh!

jbimbert commented Jun 11, 2018

Uh oh!

arina-ielchiieva commented Jun 12, 2018

Uh oh!

jbimbert commented Jun 12, 2018

Uh oh!

arina-ielchiieva commented Jun 12, 2018

Uh oh!

arina-ielchiieva Jun 14, 2018 •

edited

Loading