Skip to content

Improve estimation of row count from partition samples#11333

Merged
sopel39 merged 1 commit intotrinodb:masterfrom
raunaqmorarka:rowcount-skew
Mar 8, 2022
Merged

Improve estimation of row count from partition samples#11333
sopel39 merged 1 commit intotrinodb:masterfrom
raunaqmorarka:rowcount-skew

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

Description

Reduce the possiblity of estimation errors in averageRowsPerPartition
and rowCount due to a couple of outliers by excluding the
min and max rowCount values from the calculation of
avg rows per partition.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

hive connector statistics

How would you describe this change to a non-technical end user or system administrator?

improves estimates for partitioned hive tables

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@raunaqmorarka
Copy link
Copy Markdown
Member Author

TPC benchmark results for partitioned sf1000 orc
Rowcount skew fix sf1000 orc partitioned.pdf

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert here, but seems legit.

Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Reduce the possiblity of estimation errors in averageRowsPerPartition
and rowCount due to a couple of outliers by excluding the
min and max rowCount values from the calculation of
avg rows per partition.
@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Mar 8, 2022

lgtm % mind automation

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Test failure due to #11368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants