-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Compute dataSize for hive statistics #11043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -101,6 +101,7 @@ public TableStatistics getTableStatistics(ConnectorSession session, ConnectorTab | |
| Estimate nullsFraction; | ||
| if (hiveColumnHandle.isPartitionKey()) { | ||
| rangeStatistics.setDistinctValuesCount(countDistinctPartitionKeys(hiveColumnHandle, hivePartitions)); | ||
| rangeStatistics.setDataSize(calculateDataSize(partitionStatistics, columnName)); | ||
| nullsFraction = calculateNullsFractionForPartitioningKey(hiveColumnHandle, hivePartitions, partitionStatistics); | ||
| if (isLowHighSupportedForType(prestoType)) { | ||
| lowValueCandidates = hivePartitions.stream() | ||
|
|
@@ -114,6 +115,7 @@ public TableStatistics getTableStatistics(ConnectorSession session, ConnectorTab | |
| } | ||
| else { | ||
| rangeStatistics.setDistinctValuesCount(calculateDistinctValuesCount(partitionStatistics, columnName)); | ||
|
||
| rangeStatistics.setDataSize(calculateDataSize(partitionStatistics, columnName)); | ||
| nullsFraction = calculateNullsFraction(partitionStatistics, columnName, rowCount); | ||
|
|
||
| if (isLowHighSupportedForType(prestoType)) { | ||
|
|
@@ -212,6 +214,34 @@ private Estimate calculateDistinctValuesCount(Map<String, PartitionStatistics> s | |
| DoubleStream::max); | ||
| } | ||
|
|
||
| private Estimate calculateDataSize(Map<String, PartitionStatistics> statisticsByPartitionName, String column) | ||
| { | ||
| List<Double> knownPartitionDataSizes = statisticsByPartitionName.values().stream() | ||
| .map(stats -> { | ||
| OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Statistics are optional. Some of the partitions may have the statistics for this column missing. |
||
| OptionalLong rowCount = stats.getBasicStatistics().getRowCount(); | ||
| OptionalLong nullsCount = stats.getColumnStatistics().get(column).getNullsCount(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ditto. Just add a |
||
| if (!averageColumnLength.isPresent() || !rowCount.isPresent()) { | ||
| return OptionalDouble.empty(); | ||
| } | ||
|
|
||
| long nonNullsCount = rowCount.getAsLong() - nullsCount.orElse(0); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure. I guess it's a question of heuristically is it better to assume a random default datasize, or assume no nulls and just multiply the average column length by the number of nulls. I also think it's super unlikely that there will be a situation where you have the average column length, but not the number of nulls, but that's beside the point.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From my perspective if you don't know the number of nulls it is very similar to the situation when you don't know the number of rows. Null fraction can be pretty high, so the size estimate may be way off. I don't have a strong opinion though. |
||
| return OptionalDouble.of(averageColumnLength.getAsDouble() * nonNullsCount); | ||
| }) | ||
| .filter(OptionalDouble::isPresent) | ||
| .map(OptionalDouble::getAsDouble) | ||
| .collect(toImmutableList()); | ||
|
|
||
| double knownPartitionDataSizesSum = knownPartitionDataSizes.stream().mapToDouble(a -> a).sum(); | ||
| long partitionsWithStatsCount = knownPartitionDataSizes.size(); | ||
| long allPartitionsCount = statisticsByPartitionName.size(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not necessary correct. Statistics for some partitions might be missing. Pass a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I just copied this from the row count calculations, so it looks like we do that wrong too. I'll add a fix. |
||
|
|
||
| if (partitionsWithStatsCount == 0) { | ||
| return Estimate.unknownValue(); | ||
| } | ||
| return new Estimate(knownPartitionDataSizesSum / partitionsWithStatsCount * allPartitionsCount); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: |
||
| } | ||
|
|
||
| private Estimate calculateNullsFraction(Map<String, PartitionStatistics> statisticsByPartitionName, String column, Estimate totalRowsCount) | ||
| { | ||
| Estimate totalNullsCount = summarizePartitionStatistics( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. You have to compute it based on the
HivePartition#keys.partitionStatisticsdon't contain any information about the partition columns. Only data columns are there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see. I misunderstood what @findepi meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rschlussel2 my bad, i totally forgot. I had a PR in the queue for this. Let me rebase this code after recent changes here and see how it goes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#11107