Skip to content

Improve CBO estimates for certain scenarios#11066

Closed
raunaqmorarka wants to merge 6 commits intotrinodb:masterfrom
raunaqmorarka:cbo-estimate
Closed

Improve CBO estimates for certain scenarios#11066
raunaqmorarka wants to merge 6 commits intotrinodb:masterfrom
raunaqmorarka:cbo-estimate

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented Feb 16, 2022

Description

Overall goal of the PR is to enable optimizer.default-filter-factor-enabled by default.
If default-filter-factor is enabled with existing implementation, it improves q18 and q21 on tpch significantly.
However, it also results in regressions on certain benchmark queries (tpcds partitioned q64, tpcds unpartitioned q78).
The first 2 commits update the estimation logic of filters and joins to address the problems
with underestimation of filter conjunctions and overestimation of multi-clause joins observed
when default-filter-factor is enabled with existing implementation.

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Query optimizer

How would you describe this change to a non-technical end user or system administrator?

Improves CBO estimates in the presence of hard to estimate terms.

Related issues, pull requests, and links

Documentation

TODO
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

TODO
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Feb 16, 2022
@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from f690084 to 20985f9 Compare February 17, 2022 07:56
@raunaqmorarka
Copy link
Copy Markdown
Member Author

raunaqmorarka commented Feb 17, 2022

TPCH/TPCDS benchmark results on ORC sf1000
ORC sf1000 partitioned filter factor.pdf
ORC sf1000 unpartitioned filter factor.pdf

@sopel39 sopel39 requested a review from losipiuk February 22, 2022 13:25
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO applying UNKNOWN_FILTER_COEFFICIENT should only happen when default-filter-factor is on (we should be consistent how we work with UNKNOWN_FILTER_COEFFICIENT)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to preserve existing behaviour. That change should be in a separate PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the future, we could try to derive which term is on primary column (or column set) and only account for terms which are no on primary column

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: (off topic). I was thinking how this could be implemented so that we can just rely on set operations (add, intersect, etc) without going into shady methods (from math POV) like intersectCorrelatedStats. IMO, ideally, PlanNodeStatsEstimate has correlation matrix between columns so that we could just chain process calls together and each subsequent predicate evaluation would take correlation into account. This is too big change though.

@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from 35c687b to 897ce68 Compare February 23, 2022 09:53
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be some kind of weighted average, but I'm not sure what weighted means here

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job % comments

@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 5 times, most recently from e8cf33e to 2264273 Compare February 24, 2022 07:39
Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
perf improvements in affected queries are impressive.

@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 4 times, most recently from 913cf0a to 7edbf97 Compare February 24, 2022 12:44
@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 5 times, most recently from 1b9433d to efb9095 Compare February 28, 2022 19:21
@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from 571db4e to 0d6a82b Compare March 1, 2022 08:01
Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments % please re-run benchmarks with newest changes

@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 3 times, most recently from be1bee1 to b5d5519 Compare March 2, 2022 07:08
@raunaqmorarka
Copy link
Copy Markdown
Member Author

Re-ran benchmarks, got almost same results as before.

Estimation sf1000 orc partitioned .pdf
Estimation sf1000 orc unpartitioned.pdf

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments % please add TODO (#11066 (comment))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import for LESS_THAN (and maybe others) would make code more clear

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @findepi @losipiuk.
This makes me wonder if we shouldn't have two default filters factor triggers

  • one global
  • one for individual predicates that we cannot estimate.

@raunaqmorarka raunaqmorarka force-pushed the cbo-estimate branch 2 times, most recently from 4ee1b54 to 54c50ba Compare March 2, 2022 11:45
Currently we assume that there is no correlation between
the terms of a filter conjunction. This can result in underestimation
as there is often some correlation between columns in real data sets.
In particular, predicates inferred on the build side relation through
a join with a partitioned table are often correlated with user provided
predicates on the build side.
Estimation for filter conjunctions now applies an exponential decay
to the selectivity of each successive term to reduce chances of
under estimation.
optimizer.filter-conjunction-independence-factor is added to allow
tuning the strength of the independence assumption.
Currently we assume that there is perfect correlation between
the clauses of a join and use the most selective clause for
driving output row count estimation. This can result in overestimation
as it not necessary that columns in join keys are perfectly correlated
in real data sets.
Estimation for multi clause joins now applies an exponential decay
to the selectivity of each successive term to reduce chances of
over estimation.
optimizer.join-multi-clause-independence-factor is added to allow
tuning the strength of the independence assumption.
Using an estimate of 0.9 * (input row count) when an unestimated
term is encountered during filter estimation allows the CBO to
produce significantly better plans in certain queries.
q18 and q21 on TPCH in particular improve significantly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants