-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Fix wrong results in MetadataQueryOptimizer with subfield filters #18704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wrong results in MetadataQueryOptimizer with subfield filters #18704
Conversation
784787b to
8e0876d
Compare
arunthirupathi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume, we were previously only working on non sub field predicate and ignored the sub field predicate.
the current code correctly ands the sub field predicate in the remaining predicate.
Is my understanding right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment here on why this code is required ?
yes, exactly. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L2851 is where we add all the non-subfield equality functions to predicate. |
997ba35 to
5dccb1f
Compare
5dccb1f to
fd95ead
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there was a separate flag for subfield pushown. Do we need to check that here? Also, if it happens after the planning/optimization these info should be already available here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two different optimizations
- pushing down filters - this includes filters on subfields
- subfield pushdown - this means telling the reader to prune out all the fields from the struct that don't get used by the query
the relevant optimization here is filter pushdown. I'm actually not sure why we check that filter pushdown is enabled here, though. if there were no filter pushdown, there shouldn't be any filters in this field, and if something else put filters on this field we should want to know about them too. It seems to me that filter pushdown should just create the HiveTableLayoutHandle that it wants and nobody else should need to know how it got to that state. If that's not how it works, there's something off about the abstraction.
However, as I don't understand why the check is there in the first place, I don't feel confident removing it without more testing. Since this is just a targeted bug fix, I kept the current logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was more like if that's turned off will we still get this fix? Maybe just add a test by turning off subfield pushdown as well and make sure the fix still works with that. Also those two are independent so let's make sure this fix works even if aria scan is turned off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaikalur updated the test to include all permutations of these two properties.
|
@bot kick off tests |
Previously, queries with aggregations on partition columns and equality filters on row subfields that had optimize_metadata_queries could return wrong results. The MetadataQueryOptimizer rewrite should not take effect if there is a filter on a non-partition column that was pushed inside the scan. The way we check for these filters is by looking at the predicate and remainingPredicate in ConnectorTableLayout When converting a HiveTableLayoutHandle to ConnectorTableLayout, equality filters on row subfields were being left out. As a result, we were incorrectly applying the MetadataQueryOptimizer rewrite when there was an equality filter on a row subfield. An example affected query shape is: SELECT max(partition_column) FROM my_table WHERE struct.subfield IS NOT NULL
fd95ead to
fc6a827
Compare
|
|
||
| @Test(dataProvider = "optimize_metadata_filter_properties") | ||
| public void testMetadataAggregationFoldingWithFilters(boolean pushdownSubfieldsEnabled, boolean pushdownFilterEnabled) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
kaikalur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Queries with equality filters on row subfields that had optimize_metadata_queries could return wrong results.
The MetadataQueryOptimizer rewrite should not take effect if there is a filter on a non-partition column that was pushed inside the scan. The way we check for these filters is by looking at the predicate and remainingPredicate in ConnectorTableLayout
When converting a HiveTableLayoutHandle to ConnectorTableLayout, equality filters on row subfields were being left out. As a result, we were incorrectly applying the MetadataQueryOptimizer rewrite when there was an equality filter on a row subfiled.
An example affected query shape is:
SELECT max(partition_column)
FROM my_table
WHERE struct.subfield IS NOT NULL
Test plan - new unit test