Pull dataSize statistics from Hive for VARCHAR columns by findepi · Pull Request #11107 · prestodb/presto

findepi · 2018-07-23T09:41:44Z

No description provided.

findepi · 2018-07-23T10:04:26Z

This is alternative to #11043 -- it slipped my mind that we had this.
@rschlussel2 apologies that I didn't tell you earlier.

arhimondr · 2018-07-23T13:39:01Z

presto-product-tests/src/main/java/com/facebook/presto/tests/hive/TestHiveTableStatistics.java

Why 7? Is this an approximation error?

this is NDVs -- did you look at Fix NATION_PARTITIONED_BY_REGIONKEY data files commit alone?

Yes. There are only 5 rows per partition in the data files.

yea -- 5 or 7, i think hive's stats don't really care, do they?

Obviously they don't use sparse HLL implementation to increase precision for the low cardinality cases

arhimondr · 2018-07-23T13:43:45Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

Move the rowCount.isValueUnknown() check to the beginning of the method.

arhimondr · 2018-07-23T13:51:52Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

I think we should take into account partitions row count for better estimate

private Estimate calculateDataSize(Map<String, PartitionStatistics> statisticsSample, String columnName, Estimate totalRowCount) { if (totalRowCount.isValueUnknown()) { return Estimate.unknownValue(); } long knownRowCount = 0; double knownDataSize = 0; for (PartitionStatistics statistics : statisticsSample.values()) { OptionalLong partitionRowCount = statistics.getBasicStatistics().getRowCount(); HiveColumnStatistics columnStatistics = statistics.getColumnStatistics().get(columnName); if (columnStatistics == null) { continue; } OptionalDouble averageColumnLength = columnStatistics.getAverageColumnLength(); if (partitionRowCount.isPresent() && averageColumnLength.isPresent()) { knownRowCount += partitionRowCount.getAsLong(); knownDataSize += averageColumnLength.getAsDouble() * partitionRowCount.getAsLong(); } } if (knownRowCount == 0) { return Estimate.unknownValue(); } return new Estimate((knownDataSize * totalRowCount.getValue()) / knownRowCount); }

The code assumed average of averages is a, on average, "good enough" estimate, but of course we can make it more precise.

arhimondr · 2018-07-23T13:54:51Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

nit: Unrelated change. I'm fine with keeping it here though.

arhimondr · 2018-07-23T13:56:17Z

presto-hive/src/test/java/com/facebook/presto/hive/AbstractTestHiveClient.java

nit: Move metadata.getColumnHandles(session, tableHandle) out of the loop

arhimondr · 2018-07-23T14:30:39Z

presto-tpch/src/main/java/com/facebook/presto/tpch/statistics/StatisticsEstimator.java

How about:

private TableStatisticsData addPartitionStats(TableStatisticsData first, TableStatisticsData second, TpchColumn<?> partitionColumn) { verify(first.getColumns().keySet().equals(second.getColumns().keySet())); Set<String> columns = first.getColumns().keySet(); Map<String, ColumnStatisticsData> columnStatistics = columns.stream() .collect(toImmutableMap( column -> column, column -> combineColumnStatistics(first.getColumns().get(column), second.getColumns().get(column), column.equals(partitionColumn.getColumnName())))); return new TableStatisticsData( first.getRowCount() + second.getRowCount(), columnStatistics); } private ColumnStatisticsData combineColumnStatistics(ColumnStatisticsData first, ColumnStatisticsData second, boolean isPartitionColumn) { Optional<Long> ndv = isPartitionColumn ? Optional.empty() : combine(first.getDistinctValuesCount(), second.getDistinctValuesCount(), (a, b) -> a + b); Optional<Object> min = combine(first.getMin(), second.getMin(), this::min); Optional<Object> max = combine(first.getMax(), second.getMax(), this::max); // Sum data sizes only if both known Optional<Long> dataSize = first.getDataSize() .flatMap(leftDataSize -> second.getDataSize().map(rightDataSize -> leftDataSize + rightDataSize)); return new ColumnStatisticsData(ndv, min, max, dataSize); }

arhimondr · 2018-07-23T14:32:00Z

presto-tpch/src/main/java/com/facebook/presto/tpch/statistics/StatisticsEstimator.java

Passing the partitionColumn and the columnName all the way around seems to be unnatural. Also the trick with filter is obfuscation. We can filter the partition column in the method above. Remove this.

arhimondr · 2018-07-23T14:36:06Z

presto-tpch/src/main/java/com/facebook/presto/tpch/statistics/StatisticsEstimator.java

Make all the methods but estimateStats static

I would rather tidy up this class, since you are changing it anyway. The way it is implemented right now is hard to read. It is very confusing why the TpchColumn<?> partitionColumn, String columnName are being passed all around. So i would either drop this commit at all, and pretend as i never see this class. Or i will try to make it tidy =)

arhimondr · 2018-07-23T15:31:26Z

presto-tpch/src/main/java/com/facebook/presto/tpch/TpchMetadata.java

Unrelated change?

arhimondr · 2018-07-23T15:42:38Z

presto-tpch/src/test/java/com/facebook/presto/tpch/statistics/ColumnStatisticsRecorder.java

Consider moving the statistics recorder to the presto-tests, so you can reuse it in both, TPC-H and TPC-DS

I don't think they are identical. Anyway, let's consider this outside for this PR.

rschlussel · 2018-07-23T17:04:51Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

does averageDataSize the average over all rows or only non-null rows? if non-null rows, you need to account for that.

@sopel39's internal PR to update this says this is for non-null rows only, will update

findepi · 2018-07-26T09:08:46Z

@arhimondr comments applied, except for tpch stats estimations. Do you feel strong about them?

Before the change, the tests would report something like java.lang.AssertionError: distinctValuesCount-s differ expected [false] but found [true] Expected :false Actual :true which is not clear -- `distinctValuesCount` is not a boolean.

arhimondr

LGTM % comments

This would mean that, for ASCII text, stats computed with Hive will always be 2x what we would calculate. This can be a problem for other tools. Also, this can be problem for ourselves -- not knowing who calculated the stats, we won't know how to interpret them into internal representation (or "just" we will be 2x off).

No. For ANSII text stat is going to be exactly the same. Since the ANSI characters take 1 bytes per symbol.

arhimondr · 2018-07-26T13:41:22Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

                .filter(partition -> partition.getKeys().get(partitionColumn).isNull())
                .map(HivePartition::getPartitionId)
-                .mapToLong(partitionId -> statisticsSample.get(partitionId).getBasicStatistics().getRowCount().orElse((long) rowsPerPartition.getAsDouble()))
+                .mapToDouble(partitionId -> orElse(statisticsSample.get(partitionId).getBasicStatistics().getRowCount(), rowsPerPartition.getAsDouble()))


How about max(statisticsSample.get(partitionId).getBasicStatistics().getRowCount().orElse(0), rowsPerPartition.getAsDouble()). So you can remove this weird MetastoreHiveStatisticsProvider#orElse

This would work without this "unnice" orElse method, but doesn't convey actual meaning. Also, why should we assume a partition cannot have 0 rows?

I mean -1. I don't feel strong about this though.

arhimondr · 2018-07-26T13:48:19Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

        return unmodifiableList(result);
    }
+
+    private static double orElse(OptionalLong value, double other)


imfo: This method is ugly.

well, yes. But without it, it's even more ugly. E.g. to preserve readability we were casting double-to-long, and would get wrong results if average rows per partition was within [0,1)

arhimondr · 2018-07-26T13:50:04Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

This method is gonna be way nicer once i change HiveColumnStatsitics#averageColumnLength to HiveColumnStatsitics#totalValuesSizeInBytes

arhimondr · 2018-07-26T13:51:08Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

nullsFraction.isValueUnknown() ? 0 : nullsFraction.getValue() ?

arhimondr · 2018-07-26T13:55:21Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

== 0

p.s.: I used to like this kind of a defensive programming. But than i realize that although it is more reliable, it introduces additional confusion for the future readers. + I don't think we do something like this in Presto.

i don't like == with doubles. It scares me a lot. Updated.

arhimondr · 2018-07-26T13:56:25Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

knownNonNullRowCount

nope, we calculate include nulls (just nulls have 0 length)

arhimondr · 2018-07-26T13:56:37Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

If the value is null, do not increment row count

if the value is null, i have length=0.
if i don't increment row count, i cannot then mull by total row count (i need to consider nulls fraction as well)

arhimondr · 2018-07-26T13:57:49Z

presto-hive/src/test/java/com/facebook/presto/hive/AbstractTestHiveClient.java

nit: you can make it to be Map<String, Type> columnTypes to make it even nicer

arhimondr · 2018-07-26T14:01:02Z

presto-tpch/src/main/java/com/facebook/presto/tpch/statistics/StatisticsEstimator.java

I would rather tidy up this class, since you are changing it anyway. The way it is implemented right now is hard to read. It is very confusing why the TpchColumn<?> partitionColumn, String columnName are being passed all around. So i would either drop this commit at all, and pretend as i never see this class. Or i will try to make it tidy =)

arhimondr · 2018-07-26T14:05:33Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

No i'm confused if you have seen this method. It is totally fine if you don't want to implement it for VARBINARY. Just want to make sure you didn't do this by mistake.

findepi · 2018-07-26T14:15:37Z

No. For ANSII text stat is going to be exactly the same. Since the ANSI characters take 1 bytes per symbol.

oh, i though they are counting char elements and multiplying *2 (as if to calculate memory consumption of Java String). Thanks for explaining

arhimondr · 2018-07-26T14:22:43Z

Yeah, that's why i said

Unfortunately there is no reasonable translation can be made.

I'm storing just a number of bytes. That seems to be a most reasonable thing we can do.

findepi · 2018-07-26T14:31:38Z

No i'm confused if you have seen this method. It is totally fine if you don't want to implement it for VARBINARY. Just want to make sure you didn't do this by mistake.

i think it was omitted to simplify things. I think binary partition keys are pretty unusual.
For now I added a TODO, but will try to update this tomorrow.

arhimondr · 2018-07-26T14:46:36Z

Once you merge this i will move forward with https://github.com/arhimondr/presto/tree/collect-colunm-data-size

rschlussel · 2018-07-26T17:56:52Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

+            Estimate nullsFraction,
+            OptionalDouble rowsPerPartition)
+    {
+        if (rowCount.isValueUnknown() || !rowsPerPartition.isPresent()) {


what's the usefulness of rowsPerPartition? From my perspective it doesn't seem that useful.

It adds complexity to the method (there's this extra calculation, the ugly long->double or else, etc.)

it seems unlikely that we'd have a partition that had datasize stats but not number of rows.

It'll give weird results if you have very variable partitions.

I think we'd be better only getting only getting averageColumnLength where rowCount.isPresent() && averageColumnSize.isPresent() (and possibly nullCount.isPresent() too)

This is what I had in the latest version from my PR. I think it's simpler.

private Estimate calculateDataSize(Map<String, PartitionStatistics> statisticsByPartitionName, String column, int numberOfPartitions) { List<Double> knownPartitionDataSizes = statisticsByPartitionName.values().stream() .filter(stats -> stats.getBasicStatistics().getRowCount().isPresent() && stats.getColumnStatistics().containsKey(column) && stats.getColumnStatistics().get(column).getAverageColumnLength().isPresent() && stats.getColumnStatistics().get(column).getNullsCount().isPresent()) .map(stats -> { double averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength().getAsDouble(); long rowCount = stats.getBasicStatistics().getRowCount().getAsLong(); long nullsCount = stats.getColumnStatistics().get(column).getNullsCount().getAsLong(); long nonNullsCount = rowCount - nullsCount; return averageColumnLength * nonNullsCount; }) .collect(toImmutableList()); double knownPartitionDataSizesSum = knownPartitionDataSizes.stream().mapToDouble(a -> a).sum(); long partitionsWithStatsCount = knownPartitionDataSizes.size(); if (partitionsWithStatsCount == 0) { return Estimate.unknownValue(); } return new Estimate(knownPartitionDataSizesSum * numberOfPartitions / partitionsWithStatsCount);

what's the usefulness of rowsPerPartition? From my perspective it doesn't seem that useful.

this was extracted opportunistically -- this value was used in 2 existing places and i added 2 new usages

It'll give weird results if you have very variable partitions.

There is no ultimate cure for that, is there?

This is what I had in the latest version from my PR. I think it's simpler.

well, this PR was initially a lot simpler, but grew complicated along the way. If you don't feel strongly about this, I will keep it as it is.

To clarify-- my question wasn't about why pass in rowsPerPartition vs. compute it. It was why is rowsPerPartition (which means something like "average number of rows per-partition for the partitions that we know how many rows they have") needed in this method at all. IMO it complicates the method without providing a whole lot of value. I don't think it would be very common for the value to actually be used (because generally we either won't know the averageColumnLength or we'll also know the number of rows or possibly we won't know the number of rows for any partitions). For the rare case where the value would get used, something weird is probably going on with the partitions, and so the estimate it provides probably isn't that good. Please let me know if I'm misunderstanding something

I don't care about using my code vs. a version of this. That was to illustrate how some of the complexity could be trimmed.

@arhimondr says his upcoming pr is going to change this code anyway, so I'll approve.

arhimondr · 2018-08-02T14:24:31Z

test failures fixed, rebased, : #11176

sopel39

some retrospective comments

sopel39 · 2018-08-07T12:31:26Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java


-        return new Estimate(totalNullsCount.getValue() / totalRowsCount.getValue());
+        double nonNullCount = rowCount.getValue() * (1 - (nullsFraction.isValueUnknown() ? 0 : nullsFraction.getValue()));
+        return new Estimate(knownDataSize / knownNonNullRowCount * nonNullCount);


Is is so much more complex than: https://github.com/starburstdata/presto-private/blob/0a47179b9d0df73724863fba226372a31c9355f5/presto-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java#L381 ?
Why it is the case? Are there any practical examples where code here is better/more accurate?

Is there an advantage in computing knownNonNullRowCount and knownDataSize
vs using totalRowsCount and nullsFraction estimates?
Keep in mind that we are still using rowCount and nullsFraction at the end so we still risk measurements errors.

I see that we want to do weighted average from each partition. This still could be done more in a streaming fashion.

This method was simplified as part of the #11185. Instead of having avgColumnLength in the HiveColumnStatistics we have totalSizeInBytes. We convert avgColumnLength to totalSizeInBytes when pulling the statistics from the Hive metastore.

sopel39 · 2018-08-07T12:35:39Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

+                continue;
+            }
+            double partitionNonNullCount = partitionRowCount - partitionColumnStatistics.getNullsCount().orElse(0);
+            if (partitionNonNullCount < 0) {


Probably you should cap it at 0 in such case, otherwise you might end up with negative total data size estimates.

This code is no longer there

facebook-github-bot added the CLA Signed label Jul 23, 2018

findepi force-pushed the epic/cbo/pr/data-size branch from 3a72288 to ad9bb5c Compare July 23, 2018 09:44

findepi requested review from arhimondr and rschlussel July 23, 2018 09:44

findepi mentioned this pull request Jul 23, 2018

Compute dataSize for hive statistics #11043

Closed

findepi force-pushed the epic/cbo/pr/data-size branch from a1a589e to 0549095 Compare July 23, 2018 10:01

kokosing approved these changes Jul 23, 2018

View reviewed changes

arhimondr reviewed Jul 23, 2018

View reviewed changes

rschlussel reviewed Jul 23, 2018

View reviewed changes

findepi mentioned this pull request Jul 24, 2018

Collect column statistics on write [v2] #11054

Merged

findepi force-pushed the epic/cbo/pr/data-size branch from 0549095 to 44d90ff Compare July 26, 2018 09:09

findepi and others added 8 commits July 26, 2018 15:27

Fix assertion messages in Tpch, Tpcds stat tests

785d166

Before the change, the tests would report something like java.lang.AssertionError: distinctValuesCount-s differ expected [false] but found [true] Expected :false Actual :true which is not clear -- `distinctValuesCount` is not a boolean.

Extract variable

6a2a56e

Fix NATION_PARTITIONED_BY_REGIONKEY data files

d779df0

Remove unnecessary condition

f1b8102

Check value is known as early as possible

a72a4a9

Remove redundant else

142735b

Rename parameter to match meaning

038ddaf

Extract rowsPerPartition calculation

6ed8592

findepi force-pushed the epic/cbo/pr/data-size branch from 44d90ff to 2b278ca Compare July 26, 2018 13:28

arhimondr reviewed Jul 26, 2018

View reviewed changes

losipiuk and others added 5 commits July 26, 2018 16:14

Pull dataSize statistics from Hive for VARCHAR columns

71ba5f1

Inline StatisticsEstimator#addPartitionStats's overload

23bedcd

Update TPC-H statistics with data size

c305d90

Update TPC-DS statistics with data size

0a53b74

Add statistics test for table partitioned by varchar column

f653640

findepi force-pushed the epic/cbo/pr/data-size branch from 2b278ca to f653640 Compare July 26, 2018 14:14

findepi added 2 commits July 26, 2018 16:22

fixup! Pull dataSize statistics from Hive for VARCHAR columns

53652b7

fixup! Pull dataSize statistics from Hive for VARCHAR columns

e201f45

arhimondr approved these changes Jul 26, 2018

View reviewed changes

rschlussel reviewed Jul 26, 2018

View reviewed changes

arhimondr mentioned this pull request Jul 27, 2018

Add internal aggregations over estimated in-memory data size for stats #11150

Merged

rschlussel approved these changes Jul 31, 2018

View reviewed changes

arhimondr closed this Aug 2, 2018

sopel39 reviewed Aug 7, 2018

View reviewed changes

sopel39 mentioned this pull request Aug 7, 2018

Pull dataSize statistics from Hive for VARCHAR columns #11176

Merged

findepi deleted the epic/cbo/pr/data-size branch August 11, 2018 14:17

Conversation

findepi commented Jul 23, 2018

Uh oh!

findepi commented Jul 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi commented Jul 26, 2018

Uh oh!

arhimondr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

arhimondr left a comment •

edited

Loading