Skip to content

Fix ANALYZE when Hive partition has non-canonical value#24973

Merged
hantangwangd merged 1 commit intoprestodb:masterfrom
denodo-research-labs:analyze_hive_partitions
Jun 6, 2025
Merged

Fix ANALYZE when Hive partition has non-canonical value#24973
hantangwangd merged 1 commit intoprestodb:masterfrom
denodo-research-labs:analyze_hive_partitions

Conversation

@denodo-research-labs
Copy link
Contributor

@denodo-research-labs denodo-research-labs commented Apr 24, 2025

Description

Extracted from trinodb/trino#15995

In Hive it may well happen that a partition value is written by the writer process as a string,
e.g. : month=02, even though the column is registered in Hive as an integer.

When updating the table or when doing ANALYZE, the output in Presto of the statistics computation though for the partition
from the example above will be though 2, ending in the the following error: All computed statistics must be used.

Motivation and Context

While performing ANALYZE on the following partitioned dataset:
store_sales/d_year=2025/d_month=01/d_day=10/d_hour=00

the following exception occurs:

com.google.common.base.VerifyException: All computed statistics must be used
at com.google.common.base.Verify.verify(Verify.java:126)
at com.facebook.presto.hive.HiveMetadata.finishStatisticsCollection(HiveMetadata.java:1552)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.finishStatisticsCollection(ClassLoaderSafeConnectorMetadata.java:210)
at com.facebook.presto.metadata.MetadataManager.finishStatisticsCollection(MetadataManager.java:761)

This PR addresses the above mentioned issue by parsing the partition values to Presto values in order to avoid ignoring computed statistics.

Test Plan

Added test method testAnalyzePartitionedTableWithNonCanonicalValues

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Fix incorrectly ignoring computed table statistics in `ANALYZE`


@denodo-research-labs denodo-research-labs requested a review from a team as a code owner April 24, 2025 11:45
@denodo-research-labs denodo-research-labs force-pushed the analyze_hive_partitions branch 3 times, most recently from ac05613 to e7968cc Compare April 24, 2025 17:05
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix. Found an issue still pending, please LMK if I got anything wrong.

@tdcmeehan tdcmeehan self-assigned this May 19, 2025
@denodo-research-labs denodo-research-labs force-pushed the analyze_hive_partitions branch 5 times, most recently from d5ead55 to cca4ce2 Compare June 5, 2025 10:50
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and the newly added test case, mostly looks good to me, just one little thing.

Comment on lines 1598 to 1599
verify(usedComputedStatistics == computedStatistics.size(),
"There are multiple variants of the same partition, e.g. p=1, p=01, p=001. All partitions must follow the same key=value representation");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
verify(usedComputedStatistics == computedStatistics.size(),
"There are multiple variants of the same partition, e.g. p=1, p=01, p=001. All partitions must follow the same key=value representation");
verify(usedComputedStatistics == computedStatistics.size(),
usedComputedStatistics > computedStatistics.size() ?
"There are multiple variants of the same partition, e.g. p=1, p=01, p=001. All partitions must follow the same key=value representation" :
"All computed statistics must be used");

Do you think it makes sense to throw the new exception message only if when usedComputedStatistics > computedStatistics.size()? Since we're not entirely sure whether there are other scenarios that could cause partition value mismatches, so IMO maybe it's better to still throw the original exception message in other cases. What's your opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Do you agree with the content of the new error message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine to me.

@hantangwangd hantangwangd merged commit f3e770a into prestodb:master Jun 6, 2025
162 of 164 checks passed
@prestodb-ci prestodb-ci mentioned this pull request Jul 28, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants