Skip to content

Compute dataSize for hive statistics#11043

Closed
rschlussel wants to merge 1 commit intoprestodb:masterfrom
rschlussel:data-size-stats
Closed

Compute dataSize for hive statistics#11043
rschlussel wants to merge 1 commit intoprestodb:masterfrom
rschlussel:data-size-stats

Conversation

@rschlussel
Copy link
Contributor

Compute the datasize for hive statistics when averageColumnLength is
set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double nonNulls = rowCount.getValue() - nullCount.orElse(0); ?

more importantly, you seem to subtract per-partition value (hiveColumnStatistics.getNullsCount()) from per-table value (rowCount). Please explain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not realize that. Thanks, I'll fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing the summarizePartitionStatistics signature. Currently it accepts valueExtractFunction that takes HiveColumnStatistics and produces Double. I would change it to take PartitionStatistics. So you can easily get the row count for a give partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's easier to do without using summarizePartitionStatistics because it's got the second mapping step after the sum. I modeled it after rowCount instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DoubleStream::sum ?
Even then, partitions without stats for given column (being skipped in summarizePartitionStatistics) or partitions without rowCount and/or AverageColumnLength (being skipped here), will not be part of the computed stat value. So, probably, DoubleStream::average and then * number of all partitions ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest taking the sum, and passing the total selected partition count. Than you can count the partition with the statistics and project the total number to the total number of partitions. We do that when estimating the row_count statistic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding rangeStatistics.setDataSize(...) for partition keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have column stats for the partitioning key? I left it out because I didn't think we have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compute column stats for the partitioning keys -- since we know all the keys, we can do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need rowCount (what is total row count) here. We need it for the number of nulls, because we are computing fraction (totalNullsCount / totalRowsCount). Total data size is a sum. The formula should be sum(row_count_partition_N * avg_column_length_partition_N) over selected partitions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing the summarizePartitionStatistics signature. Currently it accepts valueExtractFunction that takes HiveColumnStatistics and produces Double. I would change it to take PartitionStatistics. So you can easily get the row count for a give partition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest taking the sum, and passing the total selected partition count. Than you can count the partition with the statistics and project the total number to the total number of partitions. We do that when estimating the row_count statistic.

Compute the datasize for hive statistics when averageColumnLength is
set.
@rschlussel
Copy link
Contributor Author

updated

Estimate nullsFraction;
if (hiveColumnHandle.isPartitionKey()) {
rangeStatistics.setDistinctValuesCount(countDistinctPartitionKeys(hiveColumnHandle, hivePartitions));
rangeStatistics.setDataSize(calculateDataSize(partitionStatistics, columnName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. You have to compute it based on the HivePartition#keys. partitionStatistics don't contain any information about the partition columns. Only data columns are there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. I misunderstood what @findepi meant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel2 my bad, i totally forgot. I had a PR in the queue for this. Let me rebase this code after recent changes here and see how it goes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
List<Double> knownPartitionDataSizes = statisticsByPartitionName.values().stream()
.map(stats -> {
OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statistics are optional. Some of the partitions may have the statistics for this column missing.

.map(stats -> {
OptionalDouble averageColumnLength = stats.getColumnStatistics().get(column).getAverageColumnLength();
OptionalLong rowCount = stats.getBasicStatistics().getRowCount();
OptionalLong nullsCount = stats.getColumnStatistics().get(column).getNullsCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Just add a filter(stats -> stats.getColumnStatistics().contains(column).

return OptionalDouble.empty();
}

long nonNullsCount = rowCount.getAsLong() - nullsCount.orElse(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullsCount.orElse(0) - This is not correct. If there is no information about number of null - we should return emtpy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I guess it's a question of heuristically is it better to assume a random default datasize, or assume no nulls and just multiply the average column length by the number of nulls. I also think it's super unlikely that there will be a situation where you have the average column length, but not the number of nulls, but that's beside the point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective if you don't know the number of nulls it is very similar to the situation when you don't know the number of rows. Null fraction can be pretty high, so the size estimate may be way off. I don't have a strong opinion though.


double knownPartitionDataSizesSum = knownPartitionDataSizes.stream().mapToDouble(a -> a).sum();
long partitionsWithStatsCount = knownPartitionDataSizes.size();
long allPartitionsCount = statisticsByPartitionName.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary correct. Statistics for some partitions might be missing. Pass a hivePartitions.size() as a queriedPartitionsCount instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I just copied this from the row count calculations, so it looks like we do that wrong too. I'll add a fix.

if (partitionsWithStatsCount == 0) {
return Estimate.unknownValue();
}
return new Estimate(knownPartitionDataSizesSum / partitionsWithStatsCount * allPartitionsCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: (knownPartitionDataSizesSum * allPartitionsCount) / partitionsWithStatsCount in case knownPartitionDataSizesSum is very small and partitionsWithStatsCount is very big

@facebook-github-bot
Copy link
Collaborator

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status.

@facebook-github-bot
Copy link
Collaborator

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@rschlussel
Copy link
Contributor Author

#11107 was merged instead

@rschlussel rschlussel closed this Aug 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants