Compute dataSize for hive statistics#11043
Compute dataSize for hive statistics#11043rschlussel wants to merge 1 commit intoprestodb:masterfrom
Conversation
There was a problem hiding this comment.
double nonNulls = rowCount.getValue() - nullCount.orElse(0); ?
more importantly, you seem to subtract per-partition value (hiveColumnStatistics.getNullsCount()) from per-table value (rowCount). Please explain.
There was a problem hiding this comment.
did not realize that. Thanks, I'll fix it.
There was a problem hiding this comment.
I would suggest changing the summarizePartitionStatistics signature. Currently it accepts valueExtractFunction that takes HiveColumnStatistics and produces Double. I would change it to take PartitionStatistics. So you can easily get the row count for a give partition.
There was a problem hiding this comment.
it's easier to do without using summarizePartitionStatistics because it's got the second mapping step after the sum. I modeled it after rowCount instead.
There was a problem hiding this comment.
DoubleStream::sum ?
Even then, partitions without stats for given column (being skipped in summarizePartitionStatistics) or partitions without rowCount and/or AverageColumnLength (being skipped here), will not be part of the computed stat value. So, probably, DoubleStream::average and then * number of all partitions ?
There was a problem hiding this comment.
I would suggest taking the sum, and passing the total selected partition count. Than you can count the partition with the statistics and project the total number to the total number of partitions. We do that when estimating the row_count statistic.
There was a problem hiding this comment.
what about adding rangeStatistics.setDataSize(...) for partition keys?
There was a problem hiding this comment.
Do we have column stats for the partitioning key? I left it out because I didn't think we have it.
There was a problem hiding this comment.
We compute column stats for the partitioning keys -- since we know all the keys, we can do that.
There was a problem hiding this comment.
I don't think we need rowCount (what is total row count) here. We need it for the number of nulls, because we are computing fraction (totalNullsCount / totalRowsCount). Total data size is a sum. The formula should be sum(row_count_partition_N * avg_column_length_partition_N) over selected partitions
There was a problem hiding this comment.
I would suggest changing the summarizePartitionStatistics signature. Currently it accepts valueExtractFunction that takes HiveColumnStatistics and produces Double. I would change it to take PartitionStatistics. So you can easily get the row count for a give partition.
There was a problem hiding this comment.
I would suggest taking the sum, and passing the total selected partition count. Than you can count the partition with the statistics and project the total number to the total number of partitions. We do that when estimating the row_count statistic.
7429b9c to
f428657
Compare
Compute the datasize for hive statistics when averageColumnLength is set.
f428657 to
4cb3549
Compare
|
updated |
| Estimate nullsFraction; | ||
| if (hiveColumnHandle.isPartitionKey()) { | ||
| rangeStatistics.setDistinctValuesCount(countDistinctPartitionKeys(hiveColumnHandle, hivePartitions)); | ||
| rangeStatistics.setDataSize(calculateDataSize(partitionStatistics, columnName)); |
There was a problem hiding this comment.
This is not correct. You have to compute it based on the HivePartition#keys. partitionStatistics don't contain any information about the partition columns. Only data columns are there.
There was a problem hiding this comment.
i see. I misunderstood what @findepi meant.
There was a problem hiding this comment.
@rschlussel2 my bad, i totally forgot. I had a PR in the queue for this. Let me rebase this code after recent changes here and see how it goes
| { | ||
| List<Double> knownPartitionDataSizes = statisticsByPartitionName.values().stream() | ||
| .map(stats -> { | ||
| OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength(); |
There was a problem hiding this comment.
Statistics are optional. Some of the partitions may have the statistics for this column missing.
| .map(stats -> { | ||
| OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength(); | ||
| OptionalLong rowCount = stats.getBasicStatistics().getRowCount(); | ||
| OptionalLong nullsCount = stats.getColumnStatistics().get(column).getNullsCount(); |
There was a problem hiding this comment.
Ditto. Just add a filter(stats -> stats.getColumnStatistics().contains(column).
| return OptionalDouble.empty(); | ||
| } | ||
|
|
||
| long nonNullsCount = rowCount.getAsLong() - nullsCount.orElse(0); |
There was a problem hiding this comment.
nullsCount.orElse(0) - This is not correct. If there is no information about number of null - we should return emtpy
There was a problem hiding this comment.
I'm not sure. I guess it's a question of heuristically is it better to assume a random default datasize, or assume no nulls and just multiply the average column length by the number of nulls. I also think it's super unlikely that there will be a situation where you have the average column length, but not the number of nulls, but that's beside the point.
There was a problem hiding this comment.
From my perspective if you don't know the number of nulls it is very similar to the situation when you don't know the number of rows. Null fraction can be pretty high, so the size estimate may be way off. I don't have a strong opinion though.
|
|
||
| double knownPartitionDataSizesSum = knownPartitionDataSizes.stream().mapToDouble(a -> a).sum(); | ||
| long partitionsWithStatsCount = knownPartitionDataSizes.size(); | ||
| long allPartitionsCount = statisticsByPartitionName.size(); |
There was a problem hiding this comment.
This is not necessary correct. Statistics for some partitions might be missing. Pass a hivePartitions.size() as a queriedPartitionsCount instead.
There was a problem hiding this comment.
Good point. I just copied this from the row count calculations, so it looks like we do that wrong too. I'll add a fix.
| if (partitionsWithStatsCount == 0) { | ||
| return Estimate.unknownValue(); | ||
| } | ||
| return new Estimate(knownPartitionDataSizesSum / partitionsWithStatsCount * allPartitionsCount); |
There was a problem hiding this comment.
nit: (knownPartitionDataSizesSum * allPartitionsCount) / partitionsWithStatsCount in case knownPartitionDataSizesSum is very small and partitionsWithStatsCount is very big
|
Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired. Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status. |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
|
#11107 was merged instead |
Compute the datasize for hive statistics when averageColumnLength is
set.