Support evaluating min/max only metadata query#14845
Merged
shixuan-fan merged 4 commits intoprestodb:masterfrom Jul 27, 2020
Merged
Support evaluating min/max only metadata query#14845shixuan-fan merged 4 commits intoprestodb:masterfrom
shixuan-fan merged 4 commits intoprestodb:masterfrom
Conversation
6d81914 to
bfca28a
Compare
Assuming we have a daily ingested table that is partitioned on ds, a filter like `ds = (SELECT '2020-07-01')` is converted into an INNER JOIN, but this value is not passed to the other side of Join, which leads to full table scan. This commit will enable this value being treated as predicate, and thus we only need to read this one partition.
highker
reviewed
Jul 27, 2020
highker
left a comment
There was a problem hiding this comment.
"Push expression translation above MetadataQueryOptimizer" LGTM
highker
reviewed
Jul 27, 2020
highker
left a comment
There was a problem hiding this comment.
"Remove unused field": Change the title to "Remove unused field for MetadataQueryOptimizer"
highker
reviewed
Jul 27, 2020
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveLogicalPlanner.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
...main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java
Outdated
Show resolved
Hide resolved
Comment on lines
270
to
280
Contributor
Author
There was a problem hiding this comment.
The result would be null if all values are null.
Contributor
Author
There was a problem hiding this comment.
That being said, I should probably move this to be the first step of this function.
Assuming we have a daily ingested table that is partitioned on ds, one common use case is to fetch data from latest ds partition. One way to compose such a query is using a filter like `ds = (SELECT max(ds) FROM table)`. However, this filter is converted into an INNER JOIN, and will lead to a full table scan on the other side of join. Instead, this commit enables a query like `SELECT max(ds) FROM table` being evaluated at optimization time when OPTIMIZE_METADATA_QUERIES is set to true, and convert it into a ValuesNode, which could then be pushed to the other side of Join to avoid expensive full table scan.
highker
approved these changes
Jul 27, 2020
13 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note that enabling existing config
optimizer.optimize-metadata-queriesand session propertyoptimize_metadata_queriesmight change query result if there are metadata that refers to empty data, e.g. empty hive partition. For example, if we have two Hive ds partitions, one is2020-07-01and the other is2020-08-01. Let's assume2020-08-01is an empty partition. Then when computing without metadata optimizer, thedsrows come from data, and since2020-08-01does not have any data, it won't be appearing in the result (e.g.DISTINCT dswould only return2020-07-01). However, if metadata optimizer is enabled, thendsrows come from metastore, andDISTINCT dswould return both rows.