Skip to content

Conversation

@jbimbert
Copy link
Contributor

In ParquetFilterPredicate, replaced canDrop with ROWS_MATCH enum to keep inside rowgroup the filter result information.
This information allows to prune the filter when all rows match.

@priteshm
Copy link

priteshm commented Jun 1, 2018

@arina-ielchiieva, @vdiravka could one of you guys please review this?

/**
* Comparison predicates for parquet filter pushdown.
*/
public class ParquetComparisonPredicates {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is too much code duplicated in the original implementation and it needs to be refactored. Please see PR #1259.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello VRozov,
you're right. I though less code modification was better (my preceding team was afraid of code refactoring...). I have refactored the class.
best regards

@arina-ielchiieva arina-ielchiieva self-requested a review June 4, 2018 10:18
testParquetFilterPushDown("t.`user`.hobby_ids[0] <> 1", 8, 6);
testParquetFilterPushDown("t.`user`.hobby_ids[2] > 20", 5, 3);
testParquetFilterPushDown("t.`user`.hobby_ids[0] between 10 and 20", 5, 4);
testParquetFilterPushDown("t.`user`.hobby_ids[0] <> 1", 8, 7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the expected output was changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Arina,
after last refactoring, the output are not changed anymore. Thank you for noticing.
best regards

return null;
}
logger.debug("materializedFilter : {}", ExpressionStringBuilder.toString(materializedFilter));
final ColumnExplorer columnExplorer = new ColumnExplorer(optionManager, columns);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert change indention change. Should be 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (LogicalExpression child : this) {
if (child instanceof ParquetFilterPredicate && ((ParquetFilterPredicate) child).canDrop(evaluator)) {
return true;
m = ((ParquetFilterPredicate) child).matches(evaluator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It would be better if variables m, temp would have meaning names, please consider the same in other places.
  2. I believe it's better to leaving instanceof check.
  3. Please consider re-writing below if statements to be more simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


protected final LogicalExpression right;

protected Statistics leftStat, rightStat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like in Drill we prefer to introduce each variable from the new line. Please consider applying the same in your code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return true;
}

protected ROWS_MATCH matchesCond() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use full name for condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* <li>MISLEAD : filter can not be applied
* </ul>
*/
enum ROWS_MATCH {ALL, NONE, SOME, MISLEAD}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming mislead to not applicable or something similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
public void setColumns(List<? extends ColumnMetadata> columns) { this.columns = columns; }

public ParquetFilterPredicate.ROWS_MATCH getRowValid() { return rowValid; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider renaming the setter and setter method to more obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// multirowgroup2 is a parquet file with 3 rowgroups inside. One with a=0, another with a=1 and a=2, and the last with a=3;
// FilterPushDown should be able to prune the filter from the scan operator according to the rowgroup statistics.
final String sql = "select * from dfs.`parquet/multirowgroup2.parquet` where ";
PlanTestBase.testPlanMatchingPatterns(sql + "a > 1", new String[]{"numRowGroups=2"}, new String[]{}); //No filter pruning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pass null instead of new String[]{} or consider using overloaded method without excluded pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

final RelNode newFilter = filter.copy(filter.getTraitSet(), ImmutableList.<RelNode>of(newScan));
call.transformTo(newFilter);

List<RowGroupInfo> rowGroupInfos = ((AbstractParquetGroupScan)newGroupScan).rowGroupInfos;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please add space after (AbstractParquetGroupScan).
  2. Please try to get rid of explicit casting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* <li>UNAPPLICABLE : filter can not be applied
* </ul>
*/
enum ROWS_MATCH {ALL, NONE, SOME, UNAPPLICABLE}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inapplicable :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (LogicalExpression child : this) {
if (child instanceof ParquetFilterPredicate && ((ParquetFilterPredicate) child).canDrop(evaluator)) {
return true;
childMatch = ((ParquetFilterPredicate) child).matches(evaluator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check of instanceof as was done before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done
Even if it can not be something else...

// "long" : as long as one branch is NOT ok to drop, we can NOT drop it.
if (! ((ParquetFilterPredicate) child).canDrop(evaluator)) {
return false;
childMatch = ((ParquetFilterPredicate) child).matches(evaluator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check for instanceof as was done before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Even if it was not there before...

* </ul>
*/
enum ROWS_MATCH {ALL, NONE, SOME, INAPPLICABLE
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to the previous line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@arina-ielchiieva
Copy link
Member

@jbimbert overall looks good, except of several minor comments. Please address them and squash the commits into one.

Also I have run Functional and Advanced tests for your branch and there is one test failure:

Query: /root/drillAutomation/drill-test-framework/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.sql
select * from dfs.drillTestDir.`parquet_storage/DRILL_6259/DRILL_6259_test_data` t where t.top_level_array[2] >= 20

Baseline: /root/drillAutomation/drill-test-framework/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.e_tsv
         Expected number of rows: 1
Actual number of rows from Drill: 2
         Number of matching rows: 1
          Number of rows missing: 0
       Number of rows unexpected: 1

These rows are not expected (first 10):
d3	3	1487165823000	[31,32]	{"nested_array":[false,true,false,false],"nested_complex_field":{"nested_array":[true,false,false,true]},"nested_list_of_complex_fields":[{"nested_array":[1131,1132,1133,1134]},{"nested_field":1485400063000,"nested_array":[1134,1135,1136,1137]},{"nested_field":1485488351000,"nested_array":[1137,1138,1139,1140]}]}	[{"nested_array":[131,132,133,134],"nested_complex_field":{"nested_array":[1485576639000,1485663099000]},"nested_list_of_complex_fields":[{"nested_array":[1131,1133]},{"nested_field":true,"nested_array":[1134,1135,1136,1137]},{"nested_field":false,"nested_array":[1137,1138,1139]}]},{"nested_field":true,"nested_array":[321,322,323],"nested_complex_field":{"nested_field":321,"nested_array":[1488255039000,1488255099000,1488341499000]},"nested_list_of_complex_fields":[{"nested_field":false,"nested_array":[1321,1322,1323]},{"nested_field":true,"nested_array":[1324,1325,1326]},{"nested_field":false,"nested_array":[1327,1328,1329]}]},{"nested_field":true,"nested_array":[331,332,333],"nested_complex_field":{"nested_field":331,"nested_array":[1489774239000,1489774299000,1489860699000]},"nested_list_of_complex_fields":[{"nested_field":false,"nested_array":[1331,1332,1333]},{"nested_field":true,"nested_array":[1334,1335,1336]},{"nested_field":true,"nested_array":[1337,1338,1339]}]}]

Data - https://github.com/mapr/drill-test-framework/tree/master/framework/resources/Datasources/parquet_storage/DRILL_6259/DRILL_6259_test_data
Query - https://github.com/mapr/drill-test-framework/blob/master/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.sql
Expected result - https://github.com/mapr/drill-test-framework/blob/master/framework/resources/Functional/filter/pushdown/data/DRILL_6259_filter_push_down_top_level_array.e_tsv

Please fix it.

@arina-ielchiieva
Copy link
Member

@agozhiy could you please confirm that expected result is correct? Or maybe we should expect two rows?

@agozhiy
Copy link
Member

agozhiy commented Jun 5, 2018

@arina-ielchiieva, the d3 row shouldn't have been returned by the query, as top_level_array[2] value for it is null.

@arina-ielchiieva
Copy link
Member

Thanks, Anton.

@jbimbert
Copy link
Contributor Author

Tests DRILL_6259 are now OK.
The issue came from a query of the form "select * from t where t.col[2] > 1", where col[] is an array; in this case, we can't deduce the value of the column from the metadata of the row group, especially is col[2] is null for example.
To correct this issue, I chose to set filter result as ROWS_MATCH.SOME instead of ROWS_MATCH.ALL for array columns only.
This should impact only queries on array columns, which will run as usual (with no filter pruning).

@arina-ielchiieva
Copy link
Member

@jbimbert

  1. Please add comment in that place in code where you fixed issue with array, explaining this case. Since you did not add fix in a separate commit, it's hard to tell where you made the change.
  2. Please resolve the conflicts.
  3. Please add check for instanceof (I am writing this point for the third time!).
  4. Please squash commits into one.

@jbimbert
Copy link
Contributor Author

The commit 0ac0825 has changed the same files as mine. I have to reconsider the solution. It will take a few days.
@arina : after the rebase do you prefer me to keep the same PR or to recreate one, as the modifications should be huge?

@arina-ielchiieva
Copy link
Member

I don't see any reasons for creating new PR. Just update this one, do force push and keep one commit.

@jbimbert jbimbert force-pushed the DRILL-5796 branch 2 times, most recently from 85c4cf3 to d50fd56 Compare June 14, 2018 14:21
@jbimbert
Copy link
Contributor Author

Hello,
Changes have been implemented. All comment taken into account (PR updated, force push and only one commit).
Automatic integration tests have failed with the following error : The job exceeded the maximum time limit for jobs, and has been terminated.
Maybe AI tests should be relaunched (I don't know how to do that).

* @param fields columns used in the filter.
* @return true if one at least is an array, false otherwise.
*/
public boolean isRepeated(Set<SchemaPath> fields) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this come from? I did not do such changes in DRILL-6259.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this function to solve the issue of DRILL_6259_test_data and comply with your comment : "Please add comment in that place in code where you fixed issue with array, explaining this case. Since you did not add fix in a separate commit, it's hard to tell where you made the change"

continue;
}
// DRILL_6259 : in case of array, we can't know if the real value of a repeated column exists until it is really filtered, hence we can't prune the filter
if (rowGroup.getRowsMatch() == ROWS_MATCH.ALL && statCollector.isRepeated(schemaPathsInExpr)) {
Copy link
Member

@arina-ielchiieva arina-ielchiieva Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statCollector.isRepeated(schemaPathsInExpr)) -> I am not sure this is right place to do the check. Previously all was done during predicate evaluation, I think moving this here is quite incorrect. Please re-consider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator
implements ParquetFilterPredicate<C> {
public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator implements ParquetFilterPredicate<C> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid format only changes unless something is wrong with the way how the code is formatted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agreed.
I just applled the intellij-idea-settings.jar which is provided by DRILL team.
As you may notice, I didn't apply it to the other files that are modified in this PR, to avoid having too many changes which are not meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That intellij-idea-settings.jar should be applied to the part of the code that is modified, not to the entire file or other files. Please revert back all format only changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

super(left.getPosition());
this.left = left;
this.right = right;
this.predicate = predicate;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is/was wrong with the functional approach? Why is it necessary to use inheritance when delegation should work just fine here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BiPredicate is perfect in case of boolean, which is not the case here. I return an enum which has no negate() for example...
Of course, I could create a new functional interface specifically for this enum, with only one function : test(), but it would only make the code more complicate.
I agree with you on this point, delegation is better than inheritance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to create a new functional interface. Use BiFunction (optionally define RawMatchFunction<C> as a concrete specification of the BiFunction<Statistics<C>, Statistics<C>, ROWS_MATCH>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator
implements ParquetFilterPredicate<C> {
public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator implements ParquetFilterPredicate<C> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That intellij-idea-settings.jar should be applied to the part of the code that is modified, not to the entire file or other files. Please revert back all format only changes.

super(left.getPosition());
this.left = left;
this.right = right;
this.predicate = predicate;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to create a new functional interface. Use BiFunction (optionally define RawMatchFunction<C> as a concrete specification of the BiFunction<Statistics<C>, Statistics<C>, ROWS_MATCH>)

case INT:
return getStatistics( ((IntHolder)minFuncHolder).value, ((IntHolder)maxFuncHolder).value);
Statistics si = getStatistics( ((IntHolder)minFuncHolder).value, ((IntHolder)maxFuncHolder).value);
si.setNumNulls(input.getNumNulls());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move si.setNumNulls() outside of the switch block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before switch: Statistics<?> statistics;;
in each case except for default: statistics = getStatistics(...);
after switch statistics.setNumNulls(); return statistics;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* <li>ALL : all rows match the filter (can not drop the row group and prune filter)
* <li>NONE : no row matches the filter (can drop the row group)
* <li>SOME : some rows only match the filter (can not drop the row group not the filter)
* <li>INAPPLICABLE : filter can not be applied
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose I have set/group of rows and a filter. Applying the filter to the set of rows may result in an empty set (NONE), partial set (SOME) and the same set as original one (ALL). What does INAPPLICABLE mean? Does it mean the result is unknown? But does it really matter whether it is partial or unknown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, which we discussed here as well.
The answer is, it does not matter right now. It would matter if we want to stat the filter accurately for example, or if we want to have a sample of results only (faster but inaccurate).
Suppressing INAPPLICABLE is possible, but we'd loose some information of course. keeping it is a plus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the prior discussion? I don't see what information would be lost if INAPPLICABLE is merged with SOME and SOME means that a result of applying a filter to a set of rows is unknown (empty, all or partial). @arina-ielchiieva what is your take?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Vlad, I don't see any value of keeping inapplicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged INAPPLICABLE into SOME.
Done

* <li>INAPPLICABLE : filter can not be applied
* </ul>
*/
enum ROWS_MATCH {ALL, NONE, SOME, INAPPLICABLE}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more common for Drill to use all capital formatting only for enum values, the enum itself uses Camel notation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
//if min value is not false or if there are all nulls -> canDrop
isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what condition hasNonNullValue() && isAllNulls() will be true? What unit test covers this use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasNonNullValue = true if min and max exist
isAllNulls = true if all rows are null values
testBooleanPredicate with File 0_0_3.parquet (contains only 3 null values)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbimbert

  • If all rows are null, what are the values for min and max, should not hasNonNullValue be false?
  • Please point me to the specific test that validates that condition.
  • I would prefer to see a unit test, not an integration test. For this particular case, the integration test validates results of a query, but it does not validate what is the result of ((ParquetFilterPredicate<Boolean>)createIsFalsePredicate(expr)).canDrop(evaluator) is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrozov this is valid case, the same issue if described in DRILL-6603, we missed hasNonNullValue check for is null predicate and filtered out row group that we shouldn't.
@jbimbert isAllNulls is used more often in your code then in previous version. In previous version it was used only three times and each time had hasNonNullValue check. I am afraid in your code checks without hasNonNullValue may produce incorrect results. I am thinking what if we modify isAllNulls method to include hasNonNullValue check inside?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrozov java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_3.parquet gives ST:[num_nulls: 3, min/max not defined] whihc means hasNonNullValue=false but isAllNulls=true, which is what we are looking for. This happens each time we run
testBooleanPredicate()
@arina-ielchiieva isallNulls would produce incorrect values if getNumNulls is undefined, which may happen whatever the return value of hasNonNullValue (see stats of parquet file 0_0_3.parquet). Would it be possible that all the stats (num_nulls, min and max) are not defined for ParquetIsPredicate entering file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbimbert to answer this question you can analyze org.apache.parquet.column.statistics.Statistics and see what can or cannot be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbimbert @arina-ielchiieva OK, I miscounted parentheses in the modified expression.
@jbimbert What is the reason you changed the order of isAllNulls() and getMin()? Unless there is a reason, please keep the original order.

What unit test covers the following cases: (min:false, max:false, all nulls:false) and (min:false, max:true, all nulls:false)? What is the expected output in the first case and the second case?

Copy link
Contributor Author

@jbimbert jbimbert Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrozov Yes, to have the same code in all lines. Easier to maintain.
Maybe you want java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_2.parquet with ST:[min: false, max: false, num_nulls: 0]
and java-exec/src/test/resources/parquetFilterPush/tfTbl/tt1.parquet with ST:[min: false, max: true, num_nulls: 0]
which are tested in testBooleanPredicate()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbimbert

  • The original code was also the same, no need to change the order.
  • Please check the use cases I provided.

return RowsMatch.SOME;
}
}
if (hasNoNulls(exprStat)) {
Copy link
Member

@arina-ielchiieva arina-ielchiieva Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbimbert, @vrozov
https://issues.apache.org/jira/browse/DRILL-6603 use case shows that statistics can be defined in the following way:
stat.isEmpty() returns false
stat.getNumNulls() returns 0
stat.genericGetMin() returns null
stat.genericGetMax() returns null
stat.toString returns num_nulls: 0, min/max not defined
In this case we mistakenly decide that such row group does not contain null values and filter out such rows group but in reality, row group contains nulls.
Statistics toString method is defined as follows:

  @Override
  public String toString() {
    if (this.hasNonNullValue()) {
      if (isNumNullsSet()) {
        return String.format("min: %s, max: %s, num_nulls: %d", minAsString(), maxAsString(), this.getNumNulls());
      } else {
        return String.format("min: %s, max: %s, num_nulls not defined", minAsString(), maxAsString());
      }
    } else if (!this.isEmpty())
      return String.format("num_nulls: %d, min/max not defined", this.getNumNulls());
    else
      return "no stats for this column";
  }

As you can see toString method also uses stat.hasNonNullValue() to show if stat is defined.
In order to correctly apply is null we need to use stat.hasNonNullValue().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testBooleanPredicate, file java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_3.parquet which has ST:[num_nulls: 3, min/max not defined] and "row group 1: RC:3 TS:27 OFFSET:4", we get the correct value isAllNulls=true, because we compare getNumNulls with rowCount, which are both 3 in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbimbert as mentioned https://issues.apache.org/jira/browse/DRILL-6603 @kkhatua tried your branch on the failed test and test was still failing because stat.hasNonNullValue() is missed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arina-ielchiieva thanks for the file. Running it, it happens that the parquet files have ST:[no stats for this column] and this is interpreted as "num_nulls=0, min/max undefined" in RangeExprEvaluator.visitTypedFieldExpr:
if (columnStatistics != null) {
return columnStatistics.getStatistics();
}
...
Which leads to the issue, since num_nulls is not defined in this case from the parquet rowgroup stats, but certainly not 0 (in this case, num_nulls > 0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arina-ielchiieva I do not suggest that Drill does not support filter push down for parquet files created by any prior version. My suggestion is to see (debug) where num_nulls is not set correctly. In any case, it is not valid to rely on hasNonNullValue() as hasNonNullValue() may be false or true and num_nulls be 0 for both cases.

Note that hasNoNulls() is the inversion of hasNulls() condition, but the exact meaning is that either number of nulls is 0 or number of nulls is not set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrozov it turns out, I was mislead by the assumption that parquet files had incorrectly defined num of nulls. It turned out that actually it was Drill that read parquet stats and then re-created stats objects with num of nulls set to 0 instead of original -1.I'll prepare separate PR with the fix for it.
@jbimbert sorry for the confusion and thank you for your patience, please remove stat.hasNonNullValue() checks where you have added them.

Copy link
Contributor Author

@jbimbert jbimbert Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arina-ielchiieva If I remove hasNonNullValue from createIsNullPredicate, the test mentioned in https://issues.apache.org/jira/browse/DRILL-6603 fails.
I can remove it in ParquetComparisonPredicate.checkNull(), but this function is used a lot...
I can also remove it from ParquetIsPredicate.checkNull(), with the same comment.
For safety, until the bug is fixed, we maybe better keep it as is ?
(In the other places, it is used to ensure getMin/getMax is valid)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on PR to fix test failure mentioned in DRILL-6603.
Please do not remove stat.hasNonNullValue() where it originally existed. Just remove where you have added the check when I asked you yesterday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. hasNonNullValue(0 removed in both checkNull() and createIsNullPredicate.

return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
//if min value is not false or if there are all nulls -> canDrop
isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbimbert

  • The original code was also the same, no need to change the order.
  • Please check the use cases I provided.

return true;
return RowsMatch.NONE;
}
if (leftStat.hasNonNullValue() && !leftStat.hasNonNullValue() || rightStat.hasNonNullValue() && !rightStat.hasNonNullValue()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does the condition evaluate to true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I noticed but the commit was already pushed and I had to wait for the tests (45 mn).
Corrected in the new push.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do your own code review and address all review concerns and let me know when the PR is ready for the code review.

@jbimbert jbimbert force-pushed the DRILL-5796 branch 2 times, most recently from fd2dc65 to 8952c1a Compare July 17, 2018 13:32
@arina-ielchiieva
Copy link
Member

@vrozov what status of this PR?

@vrozov
Copy link
Member

vrozov commented Jul 19, 2018

I am waiting for @jbimbert to address review comments and notify that the PR is ready for the review.

@jbimbert
Copy link
Contributor Author

@vrozov Hello, please review this PR

return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
//if max value is not true or if there are all nulls -> canDrop
isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to keep the original order of expressions unless it needs to be changed for a reason. If there is a reason to change it here or in any other places, please provide an explanation. I don't see why it needs to be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Reversed.

return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) ->
//if min value is not false or if there are all nulls -> canDrop
isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() ? RowsMatch.NONE : checkNull(exprStat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment

What unit test covers the following cases: (min:false, max:false, all nulls:false) and (min:false, max:true, all nulls:false)? What is the expected output in the first case and the second case?
I still don't understand how ALL can be returned here without checking max to be false in a case when number of nulls is 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blnTbl/0_0_1.parquet => ST:[min: false, max: false, num_nulls: 0] : 8 tests in testBooleanPredicate()
tfTbl/ft0.parquet => ST:[min: false, max: true, num_nulls: 0] : 4 tests in testBooleanPredicate

example1:
select * from ava-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_1.parquet where col_bln is false returns (false, false, false)

example2:
select * from java-exec/src/test/resources/parquetFilterPush/tfTbl/ft0.parquet where a is true[resp. false] return true[resp. false]

Finally, when running the query
select * from dfs.tmp.blnTbl where col_bln is false
with blnTbl contains only 0_0_0.parquet (T,T,T) and 0_0_1.parquet (F,F,F)
the physical plan reads:
00-00 Screen : rowType = RecordType(DYNAMIC_STAR ): rowcount = 3.0, cumulative cost = {9.3 rows, 12.3 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 523
00-01 Project(
=[$0]) : rowType = RecordType(DYNAMIC_STAR ): rowcount = 3.0, cumulative cost = {9.0 rows, 12.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 522
00-02 Project(
=[$0]) : rowType = RecordType(DYNAMIC_STAR **): rowcount = 3.0, cumulative cost = {6.0 rows, 9.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 521
00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/blnTbl/0_0_1.parquet]], selectionRoot=file:/tmp/blnTbl, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[**]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY col_bln): rowcount = 3.0, cumulative cost = {3.0 rows, 6.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 520
No more filter since it returns NONE for 0_0_0.parquet and ALL for 0_0_1.parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how it works for the use cases you provided, but I would expect the following unit test to pass while it actually fails:

  public void testSOME() {
    LogicalExpression le = mock(LogicalExpression.class);
    BooleanStatistics booleanStatistics = mock(BooleanStatistics.class);
    doReturn(booleanStatistics).when(le).accept(ArgumentMatchers.any(), ArgumentMatchers.any());
    RangeExprEvaluator<Boolean> re = mock(RangeExprEvaluator.class);
    when(re.getRowCount()).thenReturn(Long.valueOf(2)); // 2 rows
    when(booleanStatistics.isEmpty()).thenReturn(FALSE); // stat is not empty
    when(booleanStatistics.isNumNullsSet()).thenReturn(TRUE); // num_nulls set
    when(booleanStatistics.getNumNulls()).thenReturn(Long.valueOf(0)); // no nulls
    when(booleanStatistics.hasNonNullValue()).thenReturn(TRUE); // min/max set
    when(booleanStatistics.getMin()).thenReturn(Boolean.FALSE); // min false
    when(booleanStatistics.getMax()).thenReturn(TRUE); // max true

    assertEquals(SOME, ((ParquetIsPredicate<Boolean>) createIsPredicate(IS_TRUE, le)).matches(re));
    assertEquals(SOME, ((ParquetIsPredicate<Boolean>) createIsPredicate(IS_FALSE, le)).matches(re));
  }
java.lang.AssertionError: 
Expected :SOME
Actual   :ALL

I would prefer to see similar unit tests and not the integration tests that you point me to and that may succeed for a different reason (I'd like to see an explanation why they succeed as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I found the reason why the tests pass :

  1. we need several parquet files ,else the process is squeezed in AbstractParquetGroupScan.applyFilter.
  2. We need that some parquet files are dropped again in AbstractParquetGroupScan.applyFilter
    if (qualifiedRGs.size() == rowGroupInfos.size() ) { return null } ...
  3. If one at least of the row groups is SOME, then the filter is applied to all, in ParquetPushDownFilter.doOnMatch L 179
    These 3 conditions together make that the tests pass.
    One way to check that they fail, is to put ft0.parquet, ft0.parquet and tt1.parquet in the same folder and run a IS TRUE predicate. the result then reads F, T, F, T (wrong) instead of F, F (expected) !

I have then written the IS TRUE, IS FALSE, IS NOT TRUE and IS NOT FALSE predicates based on the cases:
a. ST:[min: true, max: true, num_nulls: ?]
b. ST:[min: false, max: false, num_nulls: ?]
c. ST:[min: false, max: true, num_nulls: ?]
d. and num_nulls = RC ( row count)
And check all cases.

I also introduced 4 helper functions for code readability: minIsTrue, minIsFalse, maxIsTrue and maxIsFalse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit testing. As you can see, integration tests may result in false positive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done added 12 unit tests for cases
a. ST:[min: true, max: true, num_nulls: 0]
b. ST:[min: false, max: false, num_nulls: 0]
c. ST:[min: false, max: true, num_nulls: 0]

@jbimbert jbimbert force-pushed the DRILL-5796 branch 2 times, most recently from 40f3495 to 98f1a89 Compare July 23, 2018 08:51
/**
* Return true if exprStat.getMin is defined and true
*/
private static Boolean minIsTrue(Statistics exprStat) { return exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions suppressed because of next comment

isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
);
return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> {
if (isAllNulls(exprStat, evaluator.getRowCount()) || maxIsFalse(exprStat)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both maxIsFalse and minIsFalse check for hasNonNullValue. Consider re-writing with a single check for hasNonNullValue:

  • if all nulls, return NONE
  • if min/max undefined, return SOME
  • return NONE/SOME/ALL based on min/max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

call.transformTo(newFilter);
}

final RelNode newFilter = filter.copy(filter.getTraitSet(), ImmutableList.<RelNode>of(newScan));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How it works in case newGroupScan is not an instance of AbstractParquetGroupScan? Will not filter.copy be called twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact. Suppressed duplicate lines

…lter result information, used to prune the filter if all rows match.
@jbimbert
Copy link
Contributor Author

@vrozov Hello, please review this PR. Best regards

@vrozov
Copy link
Member

vrozov commented Jul 27, 2018

@jbimbert Please fix merge/rebase issue.

@jbimbert
Copy link
Contributor Author

@vrozov Done.

Copy link
Member

@vrozov vrozov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready to commit

@ilooner ilooner closed this in efd6d29 Aug 1, 2018
mattpollack pushed a commit to mattpollack/drill that referenced this pull request Feb 25, 2019
…lter result information, used to prune the filter if all rows match.

closes apache#1298
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants