Skip to content

Optimize partition value evaluation in filter pushdown#25248

Merged
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:optimize_filter_pushdown
Jun 10, 2025
Merged

Optimize partition value evaluation in filter pushdown#25248
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:optimize_filter_pushdown

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Jun 4, 2025

Description

Constraint is used to validate whether a hive partition matches the filters specified in query during filter pushdown.

if (constraint.predicate().isPresent() && !constraint.predicate().get().test(partition.getKeys())) {

Here constraint.predicate().get().test(partition.getKeys()) is called for one partition, and this function is called for all candidate partitions. However, it may not be necessary to call this function for all partitions.
For example, let's say we have a table with two partition keys, ds and ts, where ds is the date and ts is the hour of the day. When the query is querying data in the past 180 days, there can be 180*24=4320 candidate partitions. However, if the predicate in constraint only contains ds but not ts, we actually only need to call this function 180 times. It can lead to big difference especially when the predicate is expensive, for example large expressions or include complex string/json operations.
In the motivating example, we see query planning time decrease from 3 minutes to 10 seconds if we reduce the number of such function calls.
In this PR, I added a function to constraint to return the column handles which are used in predicate, and populate it during filter pushdown, so as to enable the above optimization.

Motivation and Context

Reduce query planning time.

Impact

Reduce query planning time.

Test Plan

Test with production queries end to end

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add a function to Constraint to return the input arguments for the predicate

@feilong-liu feilong-liu requested a review from a team as a code owner June 4, 2025 16:32
@feilong-liu feilong-liu requested a review from ZacBlanco June 4, 2025 16:32
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Jun 4, 2025
Copy link
Member

@jaystarshot jaystarshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the predicateInputs set and used?

@feilong-liu feilong-liu force-pushed the optimize_filter_pushdown branch from 66b24fb to 226b17b Compare June 5, 2025 18:57
@feilong-liu feilong-liu changed the title Add function to Constraint to return input arguments used Optimize partition value evaluation in filter pushdown Jun 5, 2025
@feilong-liu
Copy link
Contributor Author

Where are the predicateInputs set and used?

I was thinking to split changes to spi change and the rest, updated the code to include the logic which uses this information too.

}

ImmutableList.Builder<HivePartition> resultBuilder = ImmutableList.builder();
Map<List<NullableValue>, Boolean> cacheResult = new HashMap<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache the result for predicates

Copy link
Member

@jaystarshot jaystarshot Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this cache blow up for very large distinct partition values ? maybe a guava cache is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, just updated with guava cache

@feilong-liu feilong-liu requested a review from jaystarshot June 5, 2025 19:01
@jaystarshot
Copy link
Member

LGTM , Is it easy to add a unit test in TestHivePartitionManager?

@feilong-liu feilong-liu force-pushed the optimize_filter_pushdown branch 3 times, most recently from 8fbe17d to 0cd06e3 Compare June 9, 2025 20:45
@feilong-liu
Copy link
Contributor Author

LGTM , Is it easy to add a unit test in TestHivePartitionManager?

Sure, added a new test

jaystarshot
jaystarshot previously approved these changes Jun 9, 2025
@feilong-liu
Copy link
Contributor Author

@jaystarshot Can you review it again? I just updated the TestHiveClientConfig to fix the failed test.

@steveburnett
Copy link
Contributor

Thanks for the release note entry! Is there somewhere in the documentation that it could link to?

@feilong-liu feilong-liu merged commit fdc6e65 into prestodb:master Jun 10, 2025
158 of 159 checks passed
@feilong-liu feilong-liu deleted the optimize_filter_pushdown branch June 10, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants