Improve CBO estimates for certain scenarios#11066
Improve CBO estimates for certain scenarios#11066raunaqmorarka wants to merge 6 commits intotrinodb:masterfrom
Conversation
f690084 to
20985f9
Compare
|
TPCH/TPCDS benchmark results on ORC sf1000 |
There was a problem hiding this comment.
IMO applying UNKNOWN_FILTER_COEFFICIENT should only happen when default-filter-factor is on (we should be consistent how we work with UNKNOWN_FILTER_COEFFICIENT)
There was a problem hiding this comment.
This is meant to preserve existing behaviour. That change should be in a separate PR.
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
nit: in the future, we could try to derive which term is on primary column (or column set) and only account for terms which are no on primary column
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/StatisticRange.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
NIT: (off topic). I was thinking how this could be implemented so that we can just rely on set operations (add, intersect, etc) without going into shady methods (from math POV) like intersectCorrelatedStats. IMO, ideally, PlanNodeStatsEstimate has correlation matrix between columns so that we could just chain process calls together and each subsequent predicate evaluation would take correlation into account. This is too big change though.
35c687b to
897ce68
Compare
core/trino-main/src/main/java/io/trino/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/ComparisonStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
this should probably be some kind of weighted average, but I'm not sure what weighted means here
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestShowStats.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/cost/TestFilterStatsRule.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/tests/tpch/TestTpchDistributedStats.java
Outdated
Show resolved
Hide resolved
e8cf33e to
2264273
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
lgtm
perf improvements in affected queries are impressive.
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
913cf0a to
7edbf97
Compare
1b9433d to
efb9095
Compare
571db4e to
0d6a82b
Compare
sopel39
left a comment
There was a problem hiding this comment.
lgtm % comments % please re-run benchmarks with newest changes
core/trino-main/src/main/java/io/trino/cost/StatisticRange.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/PlanNodeStatsEstimateMath.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/PlanNodeStatsEstimateMath.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/cost/TestJoinStatsRule.java
Outdated
Show resolved
Hide resolved
be1bee1 to
b5d5519
Compare
|
Re-ran benchmarks, got almost same results as before. Estimation sf1000 orc partitioned .pdf |
sopel39
left a comment
There was a problem hiding this comment.
LGTM % comments % please add TODO (#11066 (comment))
core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
nit: static import for LESS_THAN (and maybe others) would make code more clear
testing/trino-tests/src/test/java/io/trino/tests/tpch/TestTpchLocalStats.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestShowStats.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/tests/tpch/TestTpchDistributedStats.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java
Outdated
Show resolved
Hide resolved
4ee1b54 to
54c50ba
Compare
core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java
Outdated
Show resolved
Hide resolved
Currently we assume that there is no correlation between the terms of a filter conjunction. This can result in underestimation as there is often some correlation between columns in real data sets. In particular, predicates inferred on the build side relation through a join with a partitioned table are often correlated with user provided predicates on the build side. Estimation for filter conjunctions now applies an exponential decay to the selectivity of each successive term to reduce chances of under estimation. optimizer.filter-conjunction-independence-factor is added to allow tuning the strength of the independence assumption.
Currently we assume that there is perfect correlation between the clauses of a join and use the most selective clause for driving output row count estimation. This can result in overestimation as it not necessary that columns in join keys are perfectly correlated in real data sets. Estimation for multi clause joins now applies an exponential decay to the selectivity of each successive term to reduce chances of over estimation. optimizer.join-multi-clause-independence-factor is added to allow tuning the strength of the independence assumption.
Using an estimate of 0.9 * (input row count) when an unestimated term is encountered during filter estimation allows the CBO to produce significantly better plans in certain queries. q18 and q21 on TPCH in particular improve significantly.
54c50ba to
ff896d1
Compare
Description
Overall goal of the PR is to enable
optimizer.default-filter-factor-enabledby default.If default-filter-factor is enabled with existing implementation, it improves q18 and q21 on tpch significantly.
However, it also results in regressions on certain benchmark queries (tpcds partitioned q64, tpcds unpartitioned q78).
The first 2 commits update the estimation logic of filters and joins to address the problems
with underestimation of filter conjunctions and overestimation of multi-clause joins observed
when default-filter-factor is enabled with existing implementation.
Improvement
Query optimizer
Improves CBO estimates in the presence of hard to estimate terms.
Related issues, pull requests, and links
Documentation
TODO
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
TODO
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: