Skip to content

Conversation

@rschlussel
Copy link
Contributor

@rschlussel rschlussel commented Nov 18, 2022

Queries with equality filters on row subfields that had optimize_metadata_queries could return wrong results.

The MetadataQueryOptimizer rewrite should not take effect if there is a filter on a non-partition column that was pushed inside the scan. The way we check for these filters is by looking at the predicate and remainingPredicate in ConnectorTableLayout

When converting a HiveTableLayoutHandle to ConnectorTableLayout, equality filters on row subfields were being left out. As a result, we were incorrectly applying the MetadataQueryOptimizer rewrite when there was an equality filter on a row subfiled.

An example affected query shape is:
SELECT max(partition_column)
FROM my_table
WHERE struct.subfield IS NOT NULL

Test plan - new unit test

== RELEASE NOTES ==


Hive Changes
* Fix a bug when ``optimize_metadata_queries`` was set to true where queries with aggregations on partition columns and filters on row subfields could return wrong results

@rschlussel rschlussel force-pushed the fix-metadata-query-optimizer branch from 784787b to 8e0876d Compare November 18, 2022 18:32
@rschlussel rschlussel marked this pull request as ready for review November 18, 2022 18:43
@rschlussel rschlussel requested a review from a team as a code owner November 18, 2022 18:43
Copy link

@arunthirupathi arunthirupathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume, we were previously only working on non sub field predicate and ignored the sub field predicate.

the current code correctly ands the sub field predicate in the remaining predicate.

Is my understanding right ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment here on why this code is required ?

@rschlussel
Copy link
Contributor Author

I assume, we were previously only working on non sub field predicate and ignored the sub field predicate.

the current code correctly ands the sub field predicate in the remaining predicate.

Is my understanding right ?

yes, exactly. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L2851 is where we add all the non-subfield equality functions to predicate.

@rschlussel rschlussel force-pushed the fix-metadata-query-optimizer branch 2 times, most recently from 997ba35 to 5dccb1f Compare November 18, 2022 19:55
@rschlussel rschlussel force-pushed the fix-metadata-query-optimizer branch from 5dccb1f to fd95ead Compare November 18, 2022 20:23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was a separate flag for subfield pushown. Do we need to check that here? Also, if it happens after the planning/optimization these info should be already available here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different optimizations

  1. pushing down filters - this includes filters on subfields
  2. subfield pushdown - this means telling the reader to prune out all the fields from the struct that don't get used by the query

the relevant optimization here is filter pushdown. I'm actually not sure why we check that filter pushdown is enabled here, though. if there were no filter pushdown, there shouldn't be any filters in this field, and if something else put filters on this field we should want to know about them too. It seems to me that filter pushdown should just create the HiveTableLayoutHandle that it wants and nobody else should need to know how it got to that state. If that's not how it works, there's something off about the abstraction.

However, as I don't understand why the check is there in the first place, I don't feel confident removing it without more testing. Since this is just a targeted bug fix, I kept the current logic.

Copy link
Contributor

@kaikalur kaikalur Nov 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more like if that's turned off will we still get this fix? Maybe just add a test by turning off subfield pushdown as well and make sure the fix still works with that. Also those two are independent so let's make sure this fix works even if aria scan is turned off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaikalur updated the test to include all permutations of these two properties.

@v-jizhang
Copy link
Contributor

@bot kick off tests

Previously, queries with aggregations on partition columns and equality
filters on row subfields that had optimize_metadata_queries could return wrong
results.

The MetadataQueryOptimizer rewrite should not take effect if there is a
filter on a non-partition column that was pushed inside the scan.  The
way we check for these filters is by looking at the predicate and
remainingPredicate in ConnectorTableLayout

When converting a HiveTableLayoutHandle to ConnectorTableLayout, equality
filters on row subfields were being left out. As a result, we were incorrectly
applying the MetadataQueryOptimizer rewrite when there was an equality
filter on a row subfield.

An example affected query shape is:
SELECT max(partition_column)
FROM my_table
WHERE struct.subfield IS NOT NULL
@rschlussel rschlussel force-pushed the fix-metadata-query-optimizer branch from fd95ead to fc6a827 Compare November 28, 2022 18:52

@Test(dataProvider = "optimize_metadata_filter_properties")
public void testMetadataAggregationFoldingWithFilters(boolean pushdownSubfieldsEnabled, boolean pushdownFilterEnabled)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rschlussel rschlussel merged commit 82441f5 into prestodb:master Nov 28, 2022
@wanglinsong wanglinsong mentioned this pull request Jan 12, 2023
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants