Skip to content

Pull dataSize statistics from Hive for VARCHAR columns#11176

Merged
arhimondr merged 13 commits intoprestodb:masterfrom
arhimondr:epic/cbo/pr/data-size/arhimondr/fixups
Aug 2, 2018
Merged

Pull dataSize statistics from Hive for VARCHAR columns#11176
arhimondr merged 13 commits intoprestodb:masterfrom
arhimondr:epic/cbo/pr/data-size/arhimondr/fixups

Conversation

@arhimondr
Copy link
Member

@arhimondr arhimondr commented Aug 2, 2018

Supersedes #11107

@arhimondr arhimondr force-pushed the epic/cbo/pr/data-size/arhimondr/fixups branch from 2b3f534 to bd4b70f Compare August 2, 2018 15:11
@arhimondr
Copy link
Member Author

Merged

@arhimondr arhimondr closed this Aug 2, 2018
@arhimondr arhimondr merged commit bd4b70f into prestodb:master Aug 2, 2018
@arhimondr arhimondr deleted the epic/cbo/pr/data-size/arhimondr/fixups branch August 2, 2018 15:20
Copy link
Contributor

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrospective comments


return new Estimate(totalNullsCount.getValue() / totalRowsCount.getValue());
verify(knownRowCount > 0);
return new Estimate(knownDataSize / knownRowCount * rowCount.getValue());
Copy link
Contributor

@sopel39 sopel39 Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be new Estimate(knownDataSize / nonNullKnownRowCount * totalKonNullRowCount);

as TableScanStatsRule#toSymbolStatistics uses:

.setAverageRowSize(columnStatistics.getOnlyRangeColumnStatistics().getDataSize().getValue() / nonNullRowsCount)

as RangeColumnStatistics#getDataSize only relates to non-null rows.

Also see my comment: #11107 (comment)
I think this method can be simplified to be similar as: https://github.com/starburstdata/presto-private/blob/0a47179b9d0df73724863fba226372a31c9355f5/presto-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java#L381

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still:

This should be new Estimate(knownDataSize / nonNullKnownRowCount * totalKonNullRowCount);

as TableScanStatsRule#toSymbolStatistics uses:

.setAverageRowSize(columnStatistics.getOnlyRangeColumnStatistics().getDataSize().getValue() / nonNullRowsCount)

as RangeColumnStatistics#getDataSize only relates to non-null rows.

is a valid comment I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are doing here is a some kind of a linear "extrapolation"

Given the number of nulls distribution is the same (or similar) across the partition

The

<know_data_size> * ( <total_non_null_row_count> / <known_non_null_row_count>)

is effectively the same as

<know_data_size> * ( <total_row_count> / <known_row_count> )

Since

<know_data_size> * ((<total_row_count> - (<total_row_count> * <null_fraction>)) / (<known_row_count> - (<known_row_count> * <null_fraction>))

=

<know_data_size> * ((<total_row_count> * (1 - <null_fraction>)) / (<known_row_count> * (1 - <null_fraction>))

=

<know_data_size> * (<total_row_count>  / <known_row_count>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants