Fix ANALYZE when Hive partition has non-canonical value#24973
Fix ANALYZE when Hive partition has non-canonical value#24973hantangwangd merged 1 commit intoprestodb:masterfrom
Conversation
ac05613 to
e7968cc
Compare
hantangwangd
left a comment
There was a problem hiding this comment.
Thanks for this fix. Found an issue still pending, please LMK if I got anything wrong.
d5ead55 to
cca4ce2
Compare
hantangwangd
left a comment
There was a problem hiding this comment.
Thanks for the fix and the newly added test case, mostly looks good to me, just one little thing.
| verify(usedComputedStatistics == computedStatistics.size(), | ||
| "There are multiple variants of the same partition, e.g. p=1, p=01, p=001. All partitions must follow the same key=value representation"); |
There was a problem hiding this comment.
| verify(usedComputedStatistics == computedStatistics.size(), | |
| "There are multiple variants of the same partition, e.g. p=1, p=01, p=001. All partitions must follow the same key=value representation"); | |
| verify(usedComputedStatistics == computedStatistics.size(), | |
| usedComputedStatistics > computedStatistics.size() ? | |
| "There are multiple variants of the same partition, e.g. p=1, p=01, p=001. All partitions must follow the same key=value representation" : | |
| "All computed statistics must be used"); |
Do you think it makes sense to throw the new exception message only if when usedComputedStatistics > computedStatistics.size()? Since we're not entirely sure whether there are other scenarios that could cause partition value mismatches, so IMO maybe it's better to still throw the original exception message in other cases. What's your opinion?
There was a problem hiding this comment.
Sounds good. Do you agree with the content of the new error message?
cca4ce2 to
3532c15
Compare
Description
Extracted from trinodb/trino#15995
In Hive it may well happen that a partition value is written by the writer process as a string,
e.g. :
month=02, even though the column is registered in Hive as an integer.When updating the table or when doing
ANALYZE, the output in Presto of the statistics computation though for the partitionfrom the example above will be though
2, ending in the the following error:All computed statistics must be used.Motivation and Context
While performing ANALYZE on the following partitioned dataset:
store_sales/d_year=2025/d_month=01/d_day=10/d_hour=00the following exception occurs:
This PR addresses the above mentioned issue by parsing the partition values to Presto values in order to avoid ignoring computed statistics.
Test Plan
Added test method
testAnalyzePartitionedTableWithNonCanonicalValuesContributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.