-
Notifications
You must be signed in to change notification settings - Fork 986
DRILL-5796: Filter pruning for multi rowgroup parquet file #1298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@arina-ielchiieva, @vdiravka could one of you guys please review this? |
| /** | ||
| * Comparison predicates for parquet filter pushdown. | ||
| */ | ||
| public class ParquetComparisonPredicates { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is too much code duplicated in the original implementation and it needs to be refactored. Please see PR #1259.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello VRozov,
you're right. I though less code modification was better (my preceding team was afraid of code refactoring...). I have refactored the class.
best regards
| testParquetFilterPushDown("t.`user`.hobby_ids[0] <> 1", 8, 6); | ||
| testParquetFilterPushDown("t.`user`.hobby_ids[2] > 20", 5, 3); | ||
| testParquetFilterPushDown("t.`user`.hobby_ids[0] between 10 and 20", 5, 4); | ||
| testParquetFilterPushDown("t.`user`.hobby_ids[0] <> 1", 8, 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the expected output was changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Arina,
after last refactoring, the output are not changed anymore. Thank you for noticing.
best regards
| return null; | ||
| } | ||
| logger.debug("materializedFilter : {}", ExpressionStringBuilder.toString(materializedFilter)); | ||
| final ColumnExplorer columnExplorer = new ColumnExplorer(optionManager, columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert change indention change. Should be 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| for (LogicalExpression child : this) { | ||
| if (child instanceof ParquetFilterPredicate && ((ParquetFilterPredicate) child).canDrop(evaluator)) { | ||
| return true; | ||
| m = ((ParquetFilterPredicate) child).matches(evaluator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It would be better if variables
m,tempwould have meaning names, please consider the same in other places. - I believe it's better to leaving
instanceofcheck. - Please consider re-writing below if statements to be more simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| protected final LogicalExpression right; | ||
|
|
||
| protected Statistics leftStat, rightStat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like in Drill we prefer to introduce each variable from the new line. Please consider applying the same in your code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| return true; | ||
| } | ||
|
|
||
| protected ROWS_MATCH matchesCond() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use full name for condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * <li>MISLEAD : filter can not be applied | ||
| * </ul> | ||
| */ | ||
| enum ROWS_MATCH {ALL, NONE, SOME, MISLEAD} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming mislead to not applicable or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
| public void setColumns(List<? extends ColumnMetadata> columns) { this.columns = columns; } | ||
|
|
||
| public ParquetFilterPredicate.ROWS_MATCH getRowValid() { return rowValid; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider renaming the setter and setter method to more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // multirowgroup2 is a parquet file with 3 rowgroups inside. One with a=0, another with a=1 and a=2, and the last with a=3; | ||
| // FilterPushDown should be able to prune the filter from the scan operator according to the rowgroup statistics. | ||
| final String sql = "select * from dfs.`parquet/multirowgroup2.parquet` where "; | ||
| PlanTestBase.testPlanMatchingPatterns(sql + "a > 1", new String[]{"numRowGroups=2"}, new String[]{}); //No filter pruning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can pass null instead of new String[]{} or consider using overloaded method without excluded pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| final RelNode newFilter = filter.copy(filter.getTraitSet(), ImmutableList.<RelNode>of(newScan)); | ||
| call.transformTo(newFilter); | ||
|
|
||
| List<RowGroupInfo> rowGroupInfos = ((AbstractParquetGroupScan)newGroupScan).rowGroupInfos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add space after
(AbstractParquetGroupScan). - Please try to get rid of explicit casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * <li>UNAPPLICABLE : filter can not be applied | ||
| * </ul> | ||
| */ | ||
| enum ROWS_MATCH {ALL, NONE, SOME, UNAPPLICABLE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inapplicable :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| for (LogicalExpression child : this) { | ||
| if (child instanceof ParquetFilterPredicate && ((ParquetFilterPredicate) child).canDrop(evaluator)) { | ||
| return true; | ||
| childMatch = ((ParquetFilterPredicate) child).matches(evaluator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check of instanceof as was done before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Even if it can not be something else...
| // "long" : as long as one branch is NOT ok to drop, we can NOT drop it. | ||
| if (! ((ParquetFilterPredicate) child).canDrop(evaluator)) { | ||
| return false; | ||
| childMatch = ((ParquetFilterPredicate) child).matches(evaluator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check for instanceof as was done before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Even if it was not there before...
| * </ul> | ||
| */ | ||
| enum ROWS_MATCH {ALL, NONE, SOME, INAPPLICABLE | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move to the previous line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@jbimbert overall looks good, except of several minor comments. Please address them and squash the commits into one. Also I have run Functional and Advanced tests for your branch and there is one test failure: Data - https://github.com/mapr/drill-test-framework/tree/master/framework/resources/Datasources/parquet_storage/DRILL_6259/DRILL_6259_test_data Please fix it. |
|
@agozhiy could you please confirm that expected result is correct? Or maybe we should expect two rows? |
|
@arina-ielchiieva, the d3 row shouldn't have been returned by the query, as top_level_array[2] value for it is null. |
|
Thanks, Anton. |
|
Tests DRILL_6259 are now OK. |
|
|
I don't see any reasons for creating new PR. Just update this one, do force push and keep one commit. |
85c4cf3 to
d50fd56
Compare
|
Hello, |
| * @param fields columns used in the filter. | ||
| * @return true if one at least is an array, false otherwise. | ||
| */ | ||
| public boolean isRepeated(Set<SchemaPath> fields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this come from? I did not do such changes in DRILL-6259.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this function to solve the issue of DRILL_6259_test_data and comply with your comment : "Please add comment in that place in code where you fixed issue with array, explaining this case. Since you did not add fix in a separate commit, it's hard to tell where you made the change"
| continue; | ||
| } | ||
| // DRILL_6259 : in case of array, we can't know if the real value of a repeated column exists until it is really filtered, hence we can't prune the filter | ||
| if (rowGroup.getRowsMatch() == ROWS_MATCH.ALL && statCollector.isRepeated(schemaPathsInExpr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statCollector.isRepeated(schemaPathsInExpr)) -> I am not sure this is right place to do the check. Previously all was done during predicate evaluation, I think moving this here is quite incorrect. Please re-consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| */ | ||
| public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator | ||
| implements ParquetFilterPredicate<C> { | ||
| public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator implements ParquetFilterPredicate<C> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid format only changes unless something is wrong with the way how the code is formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agreed.
I just applled the intellij-idea-settings.jar which is provided by DRILL team.
As you may notice, I didn't apply it to the other files that are modified in this PR, to avoid having too many changes which are not meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That intellij-idea-settings.jar should be applied to the part of the code that is modified, not to the entire file or other files. Please revert back all format only changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| super(left.getPosition()); | ||
| this.left = left; | ||
| this.right = right; | ||
| this.predicate = predicate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is/was wrong with the functional approach? Why is it necessary to use inheritance when delegation should work just fine here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BiPredicate is perfect in case of boolean, which is not the case here. I return an enum which has no negate() for example...
Of course, I could create a new functional interface specifically for this enum, with only one function : test(), but it would only make the code more complicate.
I agree with you on this point, delegation is better than inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to create a new functional interface. Use BiFunction (optionally define RawMatchFunction<C> as a concrete specification of the BiFunction<Statistics<C>, Statistics<C>, ROWS_MATCH>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| */ | ||
| public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator | ||
| implements ParquetFilterPredicate<C> { | ||
| public abstract class ParquetBooleanPredicate<C extends Comparable<C>> extends BooleanOperator implements ParquetFilterPredicate<C> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That intellij-idea-settings.jar should be applied to the part of the code that is modified, not to the entire file or other files. Please revert back all format only changes.
| super(left.getPosition()); | ||
| this.left = left; | ||
| this.right = right; | ||
| this.predicate = predicate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to create a new functional interface. Use BiFunction (optionally define RawMatchFunction<C> as a concrete specification of the BiFunction<Statistics<C>, Statistics<C>, ROWS_MATCH>)
| case INT: | ||
| return getStatistics( ((IntHolder)minFuncHolder).value, ((IntHolder)maxFuncHolder).value); | ||
| Statistics si = getStatistics( ((IntHolder)minFuncHolder).value, ((IntHolder)maxFuncHolder).value); | ||
| si.setNumNulls(input.getNumNulls()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move si.setNumNulls() outside of the switch block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before switch: Statistics<?> statistics;;
in each case except for default: statistics = getStatistics(...);
after switch statistics.setNumNulls(); return statistics;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * <li>ALL : all rows match the filter (can not drop the row group and prune filter) | ||
| * <li>NONE : no row matches the filter (can drop the row group) | ||
| * <li>SOME : some rows only match the filter (can not drop the row group not the filter) | ||
| * <li>INAPPLICABLE : filter can not be applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose I have set/group of rows and a filter. Applying the filter to the set of rows may result in an empty set (NONE), partial set (SOME) and the same set as original one (ALL). What does INAPPLICABLE mean? Does it mean the result is unknown? But does it really matter whether it is partial or unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, which we discussed here as well.
The answer is, it does not matter right now. It would matter if we want to stat the filter accurately for example, or if we want to have a sample of results only (faster but inaccurate).
Suppressing INAPPLICABLE is possible, but we'd loose some information of course. keeping it is a plus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to the prior discussion? I don't see what information would be lost if INAPPLICABLE is merged with SOME and SOME means that a result of applying a filter to a set of rows is unknown (empty, all or partial). @arina-ielchiieva what is your take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Vlad, I don't see any value of keeping inapplicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged INAPPLICABLE into SOME.
Done
| * <li>INAPPLICABLE : filter can not be applied | ||
| * </ul> | ||
| */ | ||
| enum ROWS_MATCH {ALL, NONE, SOME, INAPPLICABLE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more common for Drill to use all capital formatting only for enum values, the enum itself uses Camel notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> | ||
| //if min value is not false or if there are all nulls -> canDrop | ||
| isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() | ||
| exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what condition hasNonNullValue() && isAllNulls() will be true? What unit test covers this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasNonNullValue = true if min and max exist
isAllNulls = true if all rows are null values
testBooleanPredicate with File 0_0_3.parquet (contains only 3 null values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If all rows are null, what are the values for min and max, should not
hasNonNullValuebe false? - Please point me to the specific test that validates that condition.
- I would prefer to see a unit test, not an integration test. For this particular case, the integration test validates results of a query, but it does not validate what is the result of
((ParquetFilterPredicate<Boolean>)createIsFalsePredicate(expr)).canDrop(evaluator)is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov this is valid case, the same issue if described in DRILL-6603, we missed hasNonNullValue check for is null predicate and filtered out row group that we shouldn't.
@jbimbert isAllNulls is used more often in your code then in previous version. In previous version it was used only three times and each time had hasNonNullValue check. I am afraid in your code checks without hasNonNullValue may produce incorrect results. I am thinking what if we modify isAllNulls method to include hasNonNullValue check inside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_3.parquet gives ST:[num_nulls: 3, min/max not defined] whihc means hasNonNullValue=false but isAllNulls=true, which is what we are looking for. This happens each time we run
testBooleanPredicate()
@arina-ielchiieva isallNulls would produce incorrect values if getNumNulls is undefined, which may happen whatever the return value of hasNonNullValue (see stats of parquet file 0_0_3.parquet). Would it be possible that all the stats (num_nulls, min and max) are not defined for ParquetIsPredicate entering file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbimbert to answer this question you can analyze org.apache.parquet.column.statistics.Statistics and see what can or cannot be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbimbert @arina-ielchiieva OK, I miscounted parentheses in the modified expression.
@jbimbert What is the reason you changed the order of isAllNulls() and getMin()? Unless there is a reason, please keep the original order.
What unit test covers the following cases: (min:false, max:false, all nulls:false) and (min:false, max:true, all nulls:false)? What is the expected output in the first case and the second case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov Yes, to have the same code in all lines. Easier to maintain.
Maybe you want java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_2.parquet with ST:[min: false, max: false, num_nulls: 0]
and java-exec/src/test/resources/parquetFilterPush/tfTbl/tt1.parquet with ST:[min: false, max: true, num_nulls: 0]
which are tested in testBooleanPredicate()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The original code was also the same, no need to change the order.
- Please check the use cases I provided.
| return RowsMatch.SOME; | ||
| } | ||
| } | ||
| if (hasNoNulls(exprStat)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbimbert, @vrozov
https://issues.apache.org/jira/browse/DRILL-6603 use case shows that statistics can be defined in the following way:
stat.isEmpty() returns false
stat.getNumNulls() returns 0
stat.genericGetMin() returns null
stat.genericGetMax() returns null
stat.toString returns num_nulls: 0, min/max not defined
In this case we mistakenly decide that such row group does not contain null values and filter out such rows group but in reality, row group contains nulls.
Statistics toString method is defined as follows:
@Override
public String toString() {
if (this.hasNonNullValue()) {
if (isNumNullsSet()) {
return String.format("min: %s, max: %s, num_nulls: %d", minAsString(), maxAsString(), this.getNumNulls());
} else {
return String.format("min: %s, max: %s, num_nulls not defined", minAsString(), maxAsString());
}
} else if (!this.isEmpty())
return String.format("num_nulls: %d, min/max not defined", this.getNumNulls());
else
return "no stats for this column";
}
As you can see toString method also uses stat.hasNonNullValue() to show if stat is defined.
In order to correctly apply is null we need to use stat.hasNonNullValue().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In testBooleanPredicate, file java-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_3.parquet which has ST:[num_nulls: 3, min/max not defined] and "row group 1: RC:3 TS:27 OFFSET:4", we get the correct value isAllNulls=true, because we compare getNumNulls with rowCount, which are both 3 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbimbert as mentioned https://issues.apache.org/jira/browse/DRILL-6603 @kkhatua tried your branch on the failed test and test was still failing because stat.hasNonNullValue() is missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arina-ielchiieva thanks for the file. Running it, it happens that the parquet files have ST:[no stats for this column] and this is interpreted as "num_nulls=0, min/max undefined" in RangeExprEvaluator.visitTypedFieldExpr:
if (columnStatistics != null) {
return columnStatistics.getStatistics();
}
...
Which leads to the issue, since num_nulls is not defined in this case from the parquet rowgroup stats, but certainly not 0 (in this case, num_nulls > 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arina-ielchiieva I do not suggest that Drill does not support filter push down for parquet files created by any prior version. My suggestion is to see (debug) where num_nulls is not set correctly. In any case, it is not valid to rely on hasNonNullValue() as hasNonNullValue() may be false or true and num_nulls be 0 for both cases.
Note that hasNoNulls() is the inversion of hasNulls() condition, but the exact meaning is that either number of nulls is 0 or number of nulls is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov it turns out, I was mislead by the assumption that parquet files had incorrectly defined num of nulls. It turned out that actually it was Drill that read parquet stats and then re-created stats objects with num of nulls set to 0 instead of original -1.I'll prepare separate PR with the fix for it.
@jbimbert sorry for the confusion and thank you for your patience, please remove stat.hasNonNullValue() checks where you have added them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arina-ielchiieva If I remove hasNonNullValue from createIsNullPredicate, the test mentioned in https://issues.apache.org/jira/browse/DRILL-6603 fails.
I can remove it in ParquetComparisonPredicate.checkNull(), but this function is used a lot...
I can also remove it from ParquetIsPredicate.checkNull(), with the same comment.
For safety, until the bug is fixed, we maybe better keep it as is ?
(In the other places, it is used to ensure getMin/getMax is valid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on PR to fix test failure mentioned in DRILL-6603.
Please do not remove stat.hasNonNullValue() where it originally existed. Just remove where you have added the check when I asked you yesterday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. hasNonNullValue(0 removed in both checkNull() and createIsNullPredicate.
| return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> | ||
| //if min value is not false or if there are all nulls -> canDrop | ||
| isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() | ||
| exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The original code was also the same, no need to change the order.
- Please check the use cases I provided.
| return true; | ||
| return RowsMatch.NONE; | ||
| } | ||
| if (leftStat.hasNonNullValue() && !leftStat.hasNonNullValue() || rightStat.hasNonNullValue() && !rightStat.hasNonNullValue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does the condition evaluate to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I noticed but the commit was already pushed and I had to wait for the tests (45 mn).
Corrected in the new push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do your own code review and address all review concerns and let me know when the PR is ready for the code review.
fd2dc65 to
8952c1a
Compare
|
@vrozov what status of this PR? |
|
I am waiting for @jbimbert to address review comments and notify that the PR is ready for the review. |
|
@vrozov Hello, please review this PR |
| return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> | ||
| //if max value is not true or if there are all nulls -> canDrop | ||
| isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax() | ||
| exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax() || isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : checkNull(exprStat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to keep the original order of expressions unless it needs to be changed for a reason. If there is a reason to change it here or in any other places, please provide an explanation. I don't see why it needs to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Reversed.
| return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> | ||
| //if min value is not false or if there are all nulls -> canDrop | ||
| isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() | ||
| isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() ? RowsMatch.NONE : checkNull(exprStat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment
What unit test covers the following cases: (min:false, max:false, all nulls:false) and (min:false, max:true, all nulls:false)? What is the expected output in the first case and the second case?
I still don't understand howALLcan be returned here without checking max to befalsein a case when number of nulls is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blnTbl/0_0_1.parquet => ST:[min: false, max: false, num_nulls: 0] : 8 tests in testBooleanPredicate()
tfTbl/ft0.parquet => ST:[min: false, max: true, num_nulls: 0] : 4 tests in testBooleanPredicate
example1:
select * from ava-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_1.parquet where col_bln is false returns (false, false, false)
example2:
select * from java-exec/src/test/resources/parquetFilterPush/tfTbl/ft0.parquet where a is true[resp. false] return true[resp. false]
Finally, when running the query
select * from dfs.tmp.blnTbl where col_bln is false
with blnTbl contains only 0_0_0.parquet (T,T,T) and 0_0_1.parquet (F,F,F)
the physical plan reads:
00-00 Screen : rowType = RecordType(DYNAMIC_STAR ): rowcount = 3.0, cumulative cost = {9.3 rows, 12.3 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 523
00-01 Project(=[$0]) : rowType = RecordType(DYNAMIC_STAR ): rowcount = 3.0, cumulative cost = {9.0 rows, 12.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 522
00-02 Project(=[$0]) : rowType = RecordType(DYNAMIC_STAR **): rowcount = 3.0, cumulative cost = {6.0 rows, 9.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 521
00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/blnTbl/0_0_1.parquet]], selectionRoot=file:/tmp/blnTbl, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[**]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY col_bln): rowcount = 3.0, cumulative cost = {3.0 rows, 6.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 520
No more filter since it returns NONE for 0_0_0.parquet and ALL for 0_0_1.parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how it works for the use cases you provided, but I would expect the following unit test to pass while it actually fails:
public void testSOME() {
LogicalExpression le = mock(LogicalExpression.class);
BooleanStatistics booleanStatistics = mock(BooleanStatistics.class);
doReturn(booleanStatistics).when(le).accept(ArgumentMatchers.any(), ArgumentMatchers.any());
RangeExprEvaluator<Boolean> re = mock(RangeExprEvaluator.class);
when(re.getRowCount()).thenReturn(Long.valueOf(2)); // 2 rows
when(booleanStatistics.isEmpty()).thenReturn(FALSE); // stat is not empty
when(booleanStatistics.isNumNullsSet()).thenReturn(TRUE); // num_nulls set
when(booleanStatistics.getNumNulls()).thenReturn(Long.valueOf(0)); // no nulls
when(booleanStatistics.hasNonNullValue()).thenReturn(TRUE); // min/max set
when(booleanStatistics.getMin()).thenReturn(Boolean.FALSE); // min false
when(booleanStatistics.getMax()).thenReturn(TRUE); // max true
assertEquals(SOME, ((ParquetIsPredicate<Boolean>) createIsPredicate(IS_TRUE, le)).matches(re));
assertEquals(SOME, ((ParquetIsPredicate<Boolean>) createIsPredicate(IS_FALSE, le)).matches(re));
}
java.lang.AssertionError:
Expected :SOME
Actual :ALL
I would prefer to see similar unit tests and not the integration tests that you point me to and that may succeed for a different reason (I'd like to see an explanation why they succeed as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I found the reason why the tests pass :
- we need several parquet files ,else the process is squeezed in AbstractParquetGroupScan.applyFilter.
- We need that some parquet files are dropped again in AbstractParquetGroupScan.applyFilter
if (qualifiedRGs.size() == rowGroupInfos.size() ) { return null } ... - If one at least of the row groups is SOME, then the filter is applied to all, in ParquetPushDownFilter.doOnMatch L 179
These 3 conditions together make that the tests pass.
One way to check that they fail, is to put ft0.parquet, ft0.parquet and tt1.parquet in the same folder and run a IS TRUE predicate. the result then reads F, T, F, T (wrong) instead of F, F (expected) !
I have then written the IS TRUE, IS FALSE, IS NOT TRUE and IS NOT FALSE predicates based on the cases:
a. ST:[min: true, max: true, num_nulls: ?]
b. ST:[min: false, max: false, num_nulls: ?]
c. ST:[min: false, max: true, num_nulls: ?]
d. and num_nulls = RC ( row count)
And check all cases.
I also introduced 4 helper functions for code readability: minIsTrue, minIsFalse, maxIsTrue and maxIsFalse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add unit testing. As you can see, integration tests may result in false positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done added 12 unit tests for cases
a. ST:[min: true, max: true, num_nulls: 0]
b. ST:[min: false, max: false, num_nulls: 0]
c. ST:[min: false, max: true, num_nulls: 0]
40f3495 to
98f1a89
Compare
| /** | ||
| * Return true if exprStat.getMin is defined and true | ||
| */ | ||
| private static Boolean minIsTrue(Statistics exprStat) { return exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions suppressed because of next comment
| isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax() | ||
| ); | ||
| return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> { | ||
| if (isAllNulls(exprStat, evaluator.getRowCount()) || maxIsFalse(exprStat)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both maxIsFalse and minIsFalse check for hasNonNullValue. Consider re-writing with a single check for hasNonNullValue:
- if all nulls, return NONE
- if min/max undefined, return SOME
- return NONE/SOME/ALL based on min/max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| call.transformTo(newFilter); | ||
| } | ||
|
|
||
| final RelNode newFilter = filter.copy(filter.getTraitSet(), ImmutableList.<RelNode>of(newScan)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How it works in case newGroupScan is not an instance of AbstractParquetGroupScan? Will not filter.copy be called twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exact. Suppressed duplicate lines
…lter result information, used to prune the filter if all rows match.
|
@vrozov Hello, please review this PR. Best regards |
|
@jbimbert Please fix merge/rebase issue. |
|
@vrozov Done. |
vrozov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready to commit
…lter result information, used to prune the filter if all rows match. closes apache#1298
In ParquetFilterPredicate, replaced canDrop with ROWS_MATCH enum to keep inside rowgroup the filter result information.
This information allows to prune the filter when all rows match.