Improve CBO estimates for certain scenarios by raunaqmorarka · Pull Request #11066 · trinodb/trino

raunaqmorarka · 2022-02-16T13:03:37Z

Description

Overall goal of the PR is to enable optimizer.default-filter-factor-enabled by default.
If default-filter-factor is enabled with existing implementation, it improves q18 and q21 on tpch significantly.
However, it also results in regressions on certain benchmark queries (tpcds partitioned q64, tpcds unpartitioned q78).
The first 2 commits update the estimation logic of filters and joins to address the problems
with underestimation of filter conjunctions and overestimation of multi-clause joins observed
when default-filter-factor is enabled with existing implementation.

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Query optimizer

How would you describe this change to a non-technical end user or system administrator?

Improves CBO estimates in the presence of hard to estimate terms.

Related issues, pull requests, and links

Documentation

TODO
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

TODO
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

raunaqmorarka · 2022-02-17T10:38:32Z

TPCH/TPCDS benchmark results on ORC sf1000
ORC sf1000 partitioned filter factor.pdf
ORC sf1000 unpartitioned filter factor.pdf

core/trino-main/src/main/java/io/trino/FeaturesConfig.java

sopel39 · 2022-02-22T14:41:23Z

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

IMO applying UNKNOWN_FILTER_COEFFICIENT should only happen when default-filter-factor is on (we should be consistent how we work with UNKNOWN_FILTER_COEFFICIENT)

This is meant to preserve existing behaviour. That change should be in a separate PR.

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

sopel39 · 2022-02-22T15:44:24Z

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

nit: in the future, we could try to derive which term is on primary column (or column set) and only account for terms which are no on primary column

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java

sopel39 · 2022-02-22T16:37:06Z

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

NIT: (off topic). I was thinking how this could be implemented so that we can just rely on set operations (add, intersect, etc) without going into shady methods (from math POV) like intersectCorrelatedStats. IMO, ideally, PlanNodeStatsEstimate has correlation matrix between columns so that we could just chain process calls together and each subsequent predicate evaluation would take correlation into account. This is too big change though.

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java

core/trino-main/src/main/java/io/trino/cost/ComparisonStatsCalculator.java

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

sopel39 · 2022-02-23T10:49:48Z

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

this should probably be some kind of weighted average, but I'm not sure what weighted means here

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java

core/trino-main/src/main/java/io/trino/cost/JoinStatsRule.java

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

core/trino-main/src/main/java/io/trino/cost/JoinStatsRule.java

sopel39

great job % comments

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestShowStats.java

core/trino-main/src/test/java/io/trino/cost/TestFilterStatsRule.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

testing/trino-tests/src/test/java/io/trino/tests/tpch/TestTpchDistributedStats.java

lukasz-stec

lgtm
perf improvements in affected queries are impressive.

core/trino-main/src/main/java/io/trino/cost/JoinStatsRule.java

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

sopel39

lgtm % comments % please re-run benchmarks with newest changes

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java

core/trino-main/src/main/java/io/trino/cost/PlanNodeStatsEstimateMath.java

core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java

core/trino-main/src/main/java/io/trino/cost/JoinStatsRule.java

core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java

core/trino-main/src/test/java/io/trino/cost/TestJoinStatsRule.java

raunaqmorarka · 2022-03-02T07:09:58Z

Re-ran benchmarks, got almost same results as before.

Estimation sf1000 orc partitioned .pdf
Estimation sf1000 orc unpartitioned.pdf

sopel39

LGTM % comments % please add TODO (#11066 (comment))

core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java

sopel39 · 2022-03-02T10:12:30Z

core/trino-main/src/test/java/io/trino/cost/TestJoinStatsRule.java

nit: static import for LESS_THAN (and maybe others) would make code more clear

testing/trino-tests/src/test/java/io/trino/tests/tpch/TestTpchLocalStats.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestShowStats.java

sopel39 · 2022-03-02T10:26:34Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

cc @findepi @losipiuk.
This makes me wonder if we shouldn't have two default filters factor triggers

one global

one for individual predicates that we cannot estimate.

testing/trino-tests/src/test/java/io/trino/tests/tpch/TestTpchDistributedStats.java

core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java

Currently we assume that there is no correlation between the terms of a filter conjunction. This can result in underestimation as there is often some correlation between columns in real data sets. In particular, predicates inferred on the build side relation through a join with a partitioned table are often correlated with user provided predicates on the build side. Estimation for filter conjunctions now applies an exponential decay to the selectivity of each successive term to reduce chances of under estimation. optimizer.filter-conjunction-independence-factor is added to allow tuning the strength of the independence assumption.

Currently we assume that there is perfect correlation between the clauses of a join and use the most selective clause for driving output row count estimation. This can result in overestimation as it not necessary that columns in join keys are perfectly correlated in real data sets. Estimation for multi clause joins now applies an exponential decay to the selectivity of each successive term to reduce chances of over estimation. optimizer.join-multi-clause-independence-factor is added to allow tuning the strength of the independence assumption.

Using an estimate of 0.9 * (input row count) when an unestimated term is encountered during filter estimation allows the CBO to produce significantly better plans in certain queries. q18 and q21 on TPCH in particular improve significantly.

cla-bot bot added the cla-signed label Feb 16, 2022

raunaqmorarka requested a review from sopel39 February 16, 2022 13:03

raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from f690084 to 20985f9 Compare February 17, 2022 07:56

github-actions bot added the tests:hive label Feb 17, 2022

sopel39 requested a review from losipiuk February 22, 2022 13:25

sopel39 reviewed Feb 22, 2022

View reviewed changes

sopel39 requested review from lukasz-stec and skrzypo987 February 22, 2022 16:29

sopel39 reviewed Feb 22, 2022

View reviewed changes

raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from 35c687b to 897ce68 Compare February 23, 2022 09:53

sopel39 reviewed Feb 23, 2022

View reviewed changes

skrzypo987 reviewed Feb 23, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/cost/JoinStatsRule.java Outdated Show resolved Hide resolved

sopel39 reviewed Feb 23, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/cost/JoinStatsRule.java Outdated Show resolved Hide resolved

sopel39 reviewed Feb 23, 2022

View reviewed changes

raunaqmorarka force-pushed the cbo-estimate branch 5 times, most recently from e8cf33e to 2264273 Compare February 24, 2022 07:39

lukasz-stec approved these changes Feb 24, 2022

View reviewed changes

sopel39 reviewed Feb 24, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the cbo-estimate branch 4 times, most recently from 913cf0a to 7edbf97 Compare February 24, 2022 12:44

raunaqmorarka force-pushed the cbo-estimate branch 5 times, most recently from 1b9433d to efb9095 Compare February 28, 2022 19:21

raunaqmorarka requested a review from sopel39 February 28, 2022 19:22

raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from 571db4e to 0d6a82b Compare March 1, 2022 08:01

sopel39 reviewed Mar 1, 2022

View reviewed changes

raunaqmorarka force-pushed the cbo-estimate branch 3 times, most recently from be1bee1 to b5d5519 Compare March 2, 2022 07:08

sopel39 approved these changes Mar 2, 2022

View reviewed changes

findepi reviewed Mar 2, 2022

View reviewed changes

testing/trino-tests/src/test/java/io/trino/tests/tpch/TestTpchDistributedStats.java Outdated Show resolved Hide resolved

findepi reviewed Mar 2, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from 4ee1b54 to 54c50ba Compare March 2, 2022 11:45

sopel39 reviewed Mar 2, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/cost/CboFeaturesConfig.java Outdated Show resolved Hide resolved

sopel39 mentioned this pull request Mar 2, 2022

Make partial aggregation adaptive #11011

Merged

raunaqmorarka added 6 commits March 4, 2022 10:22

Move averageExcludingNaNs to MoreMath

14f156b

Move minExcludeNaN, maxExcludeNaN to MoreMath

c18ee2b

Remove redundant firstNonNaN method

0caefca

raunaqmorarka force-pushed the cbo-estimate branch from 54c50ba to ff896d1 Compare March 4, 2022 04:52

raunaqmorarka mentioned this pull request Mar 4, 2022

Improve CBO estimates for correlated columns #11324

Merged

raunaqmorarka closed this Mar 21, 2022

raunaqmorarka deleted the cbo-estimate branch February 13, 2023 03:29

Conversation

raunaqmorarka commented Feb 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues, pull requests, and links

Documentation

Release notes

Uh oh!

raunaqmorarka commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raunaqmorarka commented Feb 16, 2022 •

edited

Loading

raunaqmorarka commented Feb 17, 2022 •

edited

Loading