Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ public TableStatistics getTableStatistics(ConnectorSession session, ConnectorTab
Estimate nullsFraction;
if (hiveColumnHandle.isPartitionKey()) {
rangeStatistics.setDistinctValuesCount(countDistinctPartitionKeys(hiveColumnHandle, hivePartitions));
rangeStatistics.setDataSize(calculateDataSize(partitionStatistics, columnName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. You have to compute it based on the HivePartition#keys. partitionStatistics don't contain any information about the partition columns. Only data columns are there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. I misunderstood what @findepi meant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel2 my bad, i totally forgot. I had a PR in the queue for this. Let me rebase this code after recent changes here and see how it goes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullsFraction = calculateNullsFractionForPartitioningKey(hiveColumnHandle, hivePartitions, partitionStatistics);
if (isLowHighSupportedForType(prestoType)) {
lowValueCandidates = hivePartitions.stream()
Expand All @@ -114,6 +115,7 @@ public TableStatistics getTableStatistics(ConnectorSession session, ConnectorTab
}
else {
rangeStatistics.setDistinctValuesCount(calculateDistinctValuesCount(partitionStatistics, columnName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding rangeStatistics.setDataSize(...) for partition keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have column stats for the partitioning key? I left it out because I didn't think we have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compute column stats for the partitioning keys -- since we know all the keys, we can do that.

rangeStatistics.setDataSize(calculateDataSize(partitionStatistics, columnName));
nullsFraction = calculateNullsFraction(partitionStatistics, columnName, rowCount);

if (isLowHighSupportedForType(prestoType)) {
Expand Down Expand Up @@ -212,6 +214,34 @@ private Estimate calculateDistinctValuesCount(Map<String, PartitionStatistics> s
DoubleStream::max);
}

private Estimate calculateDataSize(Map<String, PartitionStatistics> statisticsByPartitionName, String column)
{
List<Double> knownPartitionDataSizes = statisticsByPartitionName.values().stream()
.map(stats -> {
OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statistics are optional. Some of the partitions may have the statistics for this column missing.

OptionalLong rowCount = stats.getBasicStatistics().getRowCount();
OptionalLong nullsCount = stats.getColumnStatistics().get(column).getNullsCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Just add a filter(stats -> stats.getColumnStatistics().contains(column).

if (!averageColumnLength.isPresent() || !rowCount.isPresent()) {
return OptionalDouble.empty();
}

long nonNullsCount = rowCount.getAsLong() - nullsCount.orElse(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullsCount.orElse(0) - This is not correct. If there is no information about number of null - we should return emtpy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I guess it's a question of heuristically is it better to assume a random default datasize, or assume no nulls and just multiply the average column length by the number of nulls. I also think it's super unlikely that there will be a situation where you have the average column length, but not the number of nulls, but that's beside the point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective if you don't know the number of nulls it is very similar to the situation when you don't know the number of rows. Null fraction can be pretty high, so the size estimate may be way off. I don't have a strong opinion though.

return OptionalDouble.of(averageColumnLength.getAsDouble() * nonNullsCount);
})
.filter(OptionalDouble::isPresent)
.map(OptionalDouble::getAsDouble)
.collect(toImmutableList());

double knownPartitionDataSizesSum = knownPartitionDataSizes.stream().mapToDouble(a -> a).sum();
long partitionsWithStatsCount = knownPartitionDataSizes.size();
long allPartitionsCount = statisticsByPartitionName.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary correct. Statistics for some partitions might be missing. Pass a hivePartitions.size() as a queriedPartitionsCount instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I just copied this from the row count calculations, so it looks like we do that wrong too. I'll add a fix.


if (partitionsWithStatsCount == 0) {
return Estimate.unknownValue();
}
return new Estimate(knownPartitionDataSizesSum / partitionsWithStatsCount * allPartitionsCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (knownPartitionDataSizesSum * allPartitionsCount) / partitionsWithStatsCount in case knownPartitionDataSizesSum is very small and partitionsWithStatsCount is very big

}

private Estimate calculateNullsFraction(Map<String, PartitionStatistics> statisticsByPartitionName, String column, Estimate totalRowsCount)
{
Estimate totalNullsCount = summarizePartitionStatistics(
Expand Down