Pull dataSize statistics from Hive for VARCHAR columns#11176
Pull dataSize statistics from Hive for VARCHAR columns#11176arhimondr merged 13 commits intoprestodb:masterfrom
Conversation
Before the change, the tests would report something like
java.lang.AssertionError: distinctValuesCount-s differ expected [false] but found [true]
Expected :false
Actual :true
which is not clear -- `distinctValuesCount` is not a boolean.
2b3f534 to
bd4b70f
Compare
|
Merged |
|
|
||
| return new Estimate(totalNullsCount.getValue() / totalRowsCount.getValue()); | ||
| verify(knownRowCount > 0); | ||
| return new Estimate(knownDataSize / knownRowCount * rowCount.getValue()); |
There was a problem hiding this comment.
This should be new Estimate(knownDataSize / nonNullKnownRowCount * totalKonNullRowCount);
as TableScanStatsRule#toSymbolStatistics uses:
.setAverageRowSize(columnStatistics.getOnlyRangeColumnStatistics().getDataSize().getValue() / nonNullRowsCount)
as RangeColumnStatistics#getDataSize only relates to non-null rows.
Also see my comment: #11107 (comment)
I think this method can be simplified to be similar as: https://github.com/starburstdata/presto-private/blob/0a47179b9d0df73724863fba226372a31c9355f5/presto-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java#L381
There was a problem hiding this comment.
Still:
This should be new Estimate(knownDataSize / nonNullKnownRowCount * totalKonNullRowCount);
as TableScanStatsRule#toSymbolStatistics uses:
.setAverageRowSize(columnStatistics.getOnlyRangeColumnStatistics().getDataSize().getValue() / nonNullRowsCount)
as RangeColumnStatistics#getDataSize only relates to non-null rows.
is a valid comment I think
There was a problem hiding this comment.
What we are doing here is a some kind of a linear "extrapolation"
Given the number of nulls distribution is the same (or similar) across the partition
The
<know_data_size> * ( <total_non_null_row_count> / <known_non_null_row_count>)
is effectively the same as
<know_data_size> * ( <total_row_count> / <known_row_count> )
Since
<know_data_size> * ((<total_row_count> - (<total_row_count> * <null_fraction>)) / (<known_row_count> - (<known_row_count> * <null_fraction>))
=
<know_data_size> * ((<total_row_count> * (1 - <null_fraction>)) / (<known_row_count> * (1 - <null_fraction>))
=
<know_data_size> * (<total_row_count> / <known_row_count>)
Supersedes #11107