Compute dataSize for hive statistics by rschlussel · Pull Request #11043 · prestodb/presto

rschlussel · 2018-07-13T20:05:45Z

Compute the datasize for hive statistics when averageColumnLength is
set.

findepi · 2018-07-13T20:37:56Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

double nonNulls = rowCount.getValue() - nullCount.orElse(0); ?

more importantly, you seem to subtract per-partition value (hiveColumnStatistics.getNullsCount()) from per-table value (rowCount). Please explain.

did not realize that. Thanks, I'll fix it.

I would suggest changing the summarizePartitionStatistics signature. Currently it accepts valueExtractFunction that takes HiveColumnStatistics and produces Double. I would change it to take PartitionStatistics. So you can easily get the row count for a give partition.

it's easier to do without using summarizePartitionStatistics because it's got the second mapping step after the sum. I modeled it after rowCount instead.

findepi · 2018-07-13T20:40:56Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

DoubleStream::sum ?
Even then, partitions without stats for given column (being skipped in summarizePartitionStatistics) or partitions without rowCount and/or AverageColumnLength (being skipped here), will not be part of the computed stat value. So, probably, DoubleStream::average and then * number of all partitions ?

I would suggest taking the sum, and passing the total selected partition count. Than you can count the partition with the statistics and project the total number to the total number of partitions. We do that when estimating the row_count statistic.

findepi · 2018-07-13T20:44:16Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

what about adding rangeStatistics.setDataSize(...) for partition keys?

Do we have column stats for the partitioning key? I left it out because I didn't think we have it.

We compute column stats for the partitioning keys -- since we know all the keys, we can do that.

arhimondr · 2018-07-17T23:26:00Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

I don't think we need rowCount (what is total row count) here. We need it for the number of nulls, because we are computing fraction (totalNullsCount / totalRowsCount). Total data size is a sum. The formula should be sum(row_count_partition_N * avg_column_length_partition_N) over selected partitions

arhimondr · 2018-07-17T23:26:13Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

I would suggest changing the summarizePartitionStatistics signature. Currently it accepts valueExtractFunction that takes HiveColumnStatistics and produces Double. I would change it to take PartitionStatistics. So you can easily get the row count for a give partition.

arhimondr · 2018-07-17T23:29:52Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

I would suggest taking the sum, and passing the total selected partition count. Than you can count the partition with the statistics and project the total number to the total number of partitions. We do that when estimating the row_count statistic.

Compute the datasize for hive statistics when averageColumnLength is set.

rschlussel · 2018-07-19T17:03:19Z

updated

arhimondr · 2018-07-20T14:22:19Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

            Estimate nullsFraction;
            if (hiveColumnHandle.isPartitionKey()) {
                rangeStatistics.setDistinctValuesCount(countDistinctPartitionKeys(hiveColumnHandle, hivePartitions));
+                rangeStatistics.setDataSize(calculateDataSize(partitionStatistics, columnName));


This is not correct. You have to compute it based on the HivePartition#keys. partitionStatistics don't contain any information about the partition columns. Only data columns are there.

i see. I misunderstood what @findepi meant.

@rschlussel2 my bad, i totally forgot. I had a PR in the queue for this. Let me rebase this code after recent changes here and see how it goes

arhimondr · 2018-07-20T14:24:33Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

+    {
+        List<Double> knownPartitionDataSizes = statisticsByPartitionName.values().stream()
+                .map(stats -> {
+                    OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength();


Statistics are optional. Some of the partitions may have the statistics for this column missing.

arhimondr · 2018-07-20T14:25:09Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

+                .map(stats -> {
+                    OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength();
+                    OptionalLong rowCount = stats.getBasicStatistics().getRowCount();
+                    OptionalLong nullsCount = stats.getColumnStatistics().get(column).getNullsCount();


Ditto. Just add a filter(stats -> stats.getColumnStatistics().contains(column).

arhimondr · 2018-07-20T14:27:59Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

+                        return OptionalDouble.empty();
+                    }
+
+                    long nonNullsCount = rowCount.getAsLong() - nullsCount.orElse(0);


nullsCount.orElse(0) - This is not correct. If there is no information about number of null - we should return emtpy

I'm not sure. I guess it's a question of heuristically is it better to assume a random default datasize, or assume no nulls and just multiply the average column length by the number of nulls. I also think it's super unlikely that there will be a situation where you have the average column length, but not the number of nulls, but that's beside the point.

From my perspective if you don't know the number of nulls it is very similar to the situation when you don't know the number of rows. Null fraction can be pretty high, so the size estimate may be way off. I don't have a strong opinion though.

arhimondr · 2018-07-20T14:30:06Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

+
+        double knownPartitionDataSizesSum = knownPartitionDataSizes.stream().mapToDouble(a -> a).sum();
+        long partitionsWithStatsCount = knownPartitionDataSizes.size();
+        long allPartitionsCount = statisticsByPartitionName.size();


This is not necessary correct. Statistics for some partitions might be missing. Pass a hivePartitions.size() as a queriedPartitionsCount instead.

Good point. I just copied this from the row count calculations, so it looks like we do that wrong too. I'll add a fix.

arhimondr · 2018-07-20T14:31:46Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

+        if (partitionsWithStatsCount == 0) {
+            return Estimate.unknownValue();
+        }
+        return new Estimate(knownPartitionDataSizesSum / partitionsWithStatsCount * allPartitionsCount);


nit: (knownPartitionDataSizesSum * allPartitionsCount) / partitionsWithStatsCount in case knownPartitionDataSizesSum is very small and partitionsWithStatsCount is very big

facebook-github-bot · 2018-07-25T20:50:15Z

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status.

facebook-github-bot · 2018-08-01T18:34:54Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

rschlussel · 2018-08-10T17:00:03Z

#11107 was merged instead

facebook-github-bot added the CLA Signed label Jul 13, 2018

rschlussel requested review from arhimondr and findepi July 13, 2018 20:06

findepi reviewed Jul 13, 2018

View reviewed changes

arhimondr reviewed Jul 17, 2018

View reviewed changes

rschlussel force-pushed the data-size-stats branch from 7429b9c to f428657 Compare July 19, 2018 17:00

Compute dataSize for hive statistics

4cb3549

Compute the datasize for hive statistics when averageColumnLength is set.

rschlussel force-pushed the data-size-stats branch from f428657 to 4cb3549 Compare July 19, 2018 17:03

arhimondr reviewed Jul 20, 2018

View reviewed changes

findepi mentioned this pull request Jul 23, 2018

Pull dataSize statistics from Hive for VARCHAR columns #11107

Closed

rschlussel closed this Aug 10, 2018

Conversation

rschlussel commented Jul 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rschlussel commented Jul 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 25, 2018

Uh oh!

facebook-github-bot commented Aug 1, 2018

Uh oh!

rschlussel commented Aug 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants