Pull dataSize statistics from Hive for VARCHAR columns by arhimondr · Pull Request #11176 · prestodb/presto

arhimondr · 2018-08-02T14:23:58Z

Supersedes #11107

Before the change, the tests would report something like java.lang.AssertionError: distinctValuesCount-s differ expected [false] but found [true] Expected :false Actual :true which is not clear -- `distinctValuesCount` is not a boolean.

arhimondr · 2018-08-02T15:12:43Z

Merged

sopel39

retrospective comments

sopel39 · 2018-08-07T12:47:45Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java


-        return new Estimate(totalNullsCount.getValue() / totalRowsCount.getValue());
+        verify(knownRowCount > 0);
+        return new Estimate(knownDataSize / knownRowCount * rowCount.getValue());


This should be new Estimate(knownDataSize / nonNullKnownRowCount * totalKonNullRowCount);

as TableScanStatsRule#toSymbolStatistics uses:

.setAverageRowSize(columnStatistics.getOnlyRangeColumnStatistics().getDataSize().getValue() / nonNullRowsCount)

as RangeColumnStatistics#getDataSize only relates to non-null rows.

Also see my comment: #11107 (comment)
I think this method can be simplified to be similar as: https://github.com/starburstdata/presto-private/blob/0a47179b9d0df73724863fba226372a31c9355f5/presto-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java#L381

#11107 (comment)

Still:

This should be new Estimate(knownDataSize / nonNullKnownRowCount * totalKonNullRowCount); as TableScanStatsRule#toSymbolStatistics uses: .setAverageRowSize(columnStatistics.getOnlyRangeColumnStatistics().getDataSize().getValue() / nonNullRowsCount) as RangeColumnStatistics#getDataSize only relates to non-null rows.

is a valid comment I think

What we are doing here is a some kind of a linear "extrapolation"

Given the number of nulls distribution is the same (or similar) across the partition

The

<know_data_size> * ( <total_non_null_row_count> / <known_non_null_row_count>)

is effectively the same as

<know_data_size> * ( <total_row_count> / <known_row_count> )

Since

<know_data_size> * ((<total_row_count> - (<total_row_count> * <null_fraction>)) / (<known_row_count> - (<known_row_count> * <null_fraction>))

=

<know_data_size> * ((<total_row_count> * (1 - <null_fraction>)) / (<known_row_count> * (1 - <null_fraction>))

=

<know_data_size> * (<total_row_count> / <known_row_count>)

facebook-github-bot added the CLA Signed label Aug 2, 2018

arhimondr mentioned this pull request Aug 2, 2018

Pull dataSize statistics from Hive for VARCHAR columns #11107

Closed

findepi and others added 13 commits August 2, 2018 11:08

Fix assertion messages in Tpch, Tpcds stat tests

212cd5c

Before the change, the tests would report something like java.lang.AssertionError: distinctValuesCount-s differ expected [false] but found [true] Expected :false Actual :true which is not clear -- `distinctValuesCount` is not a boolean.

Extract variable

daff49a

Fix NATION_PARTITIONED_BY_REGIONKEY data files

e907dbe

Remove unnecessary condition

899d6a6

Check value is known as early as possible

5817e55

Remove redundant else

d694edc

Rename parameter to match meaning

f673abd

Extract rowsPerPartition calculation

9011742

Pull dataSize statistics from Hive for VARCHAR columns

122015c

Inline StatisticsEstimator#addPartitionStats's overload

d817709

Update TPC-H statistics with data size

fc5419d

Update TPC-DS statistics with data size

b69e78b

Add statistics test for table partitioned by varchar column

bd4b70f

arhimondr force-pushed the epic/cbo/pr/data-size/arhimondr/fixups branch from 2b3f534 to bd4b70f Compare August 2, 2018 15:11

arhimondr closed this Aug 2, 2018

arhimondr merged commit bd4b70f into prestodb:master Aug 2, 2018

arhimondr deleted the epic/cbo/pr/data-size/arhimondr/fixups branch August 2, 2018 15:20

sopel39 reviewed Aug 7, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull dataSize statistics from Hive for VARCHAR columns#11176

Pull dataSize statistics from Hive for VARCHAR columns#11176
arhimondr merged 13 commits intoprestodb:masterfrom
arhimondr:epic/cbo/pr/data-size/arhimondr/fixups

arhimondr commented Aug 2, 2018 •

edited by findepi

Loading

Uh oh!

arhimondr commented Aug 2, 2018

Uh oh!

sopel39 left a comment

Uh oh!

sopel39 Aug 7, 2018 •

edited

Loading

Uh oh!

arhimondr Aug 7, 2018

Uh oh!

sopel39 Aug 7, 2018

Uh oh!

arhimondr Aug 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

arhimondr commented Aug 2, 2018 • edited by findepi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arhimondr commented Aug 2, 2018

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

sopel39 Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Aug 7, 2018

Choose a reason for hiding this comment

Uh oh!

sopel39 Aug 7, 2018

Choose a reason for hiding this comment

Uh oh!

arhimondr Aug 13, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arhimondr commented Aug 2, 2018 •

edited by findepi

Loading

sopel39 Aug 7, 2018 •

edited

Loading