Fix ANALYZE when Hive partition has non-canonical value#15995
Fix ANALYZE when Hive partition has non-canonical value#15995ebyhr merged 2 commits intotrinodb:masterfrom
Conversation
d083fdb to
24eec9b
Compare
e7941f6 to
2f162b5
Compare
2f162b5 to
1f88aee
Compare
Change to something like "Fix ANALYZE when Hive partition has non-canonical value" |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Do we need that in finishCreateTable flow?
I think all partition names and value lists are canonical (because we created them).
There was a problem hiding this comment.
maybe we dont' need this, but then worth adding a code comment why we don't need this
There was a problem hiding this comment.
Good point. I used an assignment to a canonicalPartitionValues to underline the fact that the partition values are already canonical.
plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestHiveOnDataLake.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
we're relying on sync_partition_metadata behavior, which is implicit & subject to change
in fact, sync_partition_metadata could canonicalize the values inferred from the storage, why not
there should be some onHive way to enter into this non-canonical partitions state.
Or, we could use HMS client directly (product tests can do that)
There was a problem hiding this comment.
I rewrote the test to the form:
@Test
public void testAnalyzePartitionedTableWithCanonicalization()
{
String tableName = "test_analyze_table_canonicalization_" + randomNameSuffix();
assertUpdate("CREATE TABLE %s (a_varchar varchar, month integer) WITH (partitioned_by = ARRAY['month'])".formatted(getFullyQualifiedTestTableName(tableName)));
assertUpdate("INSERT INTO " + getFullyQualifiedTestTableName(tableName) + " VALUES ('A', 1), ('AA', 1), ('B', 2), ('M', 10)", 4);
// Simulate the fact that the Hive table has non-canonical partition values
hiveMinioDataLake.getHiveHadoop().runOnMetastore(
"""
UPDATE PARTITIONS
SET part_name='month=01'
WHERE
TBL_ID IN (SELECT tbl_id FROM TBLS t INNER JOIN DBS db ON t.db_id = db.db_id WHERE db.name = '%s' and t.tbl_name = '%s') AND
PART_NAME='month=1'
""".formatted(HIVE_TEST_SCHEMA, tableName));
assertQuery("SELECT * FROM " + HIVE_TEST_SCHEMA + ".\"" + tableName + "$partitions\"", "VALUES 1, 2, 10");
assertUpdate("ANALYZE " + getFullyQualifiedTestTableName(tableName), 4);
assertQuery("SHOW STATS FOR " + getFullyQualifiedTestTableName(tableName),
"""
VALUES
('a_varchar', 5.0, 2.0, 0.0, null, null, null),
('month', null, 3.0, 0.0, null, 1, 10),
(null, null, null, null, 4.0, null, null)
""");
assertUpdate("INSERT INTO " + getFullyQualifiedTestTableName(tableName) + " VALUES ('C', 3)", 1);
hiveMinioDataLake.getHiveHadoop().runOnMetastore(
"""
UPDATE PARTITIONS
SET part_name='month=03'
WHERE
TBL_ID IN (SELECT tbl_id FROM TBLS t INNER JOIN DBS db ON t.db_id = db.db_id WHERE db.name = '%s' and t.tbl_name = '%s') AND
PART_NAME='month=3'
""".formatted(HIVE_TEST_SCHEMA, tableName));
assertUpdate("ANALYZE " + getFullyQualifiedTestTableName(tableName) + " WITH (partitions = ARRAY[ARRAY['03']])", 1);
assertQuery("SHOW STATS FOR " + getFullyQualifiedTestTableName(tableName),
"""
VALUES
('a_varchar', 6.0, 2.0, 0.0, null, null, null),
('month', null, 4.0, 0.0, null, 1, 10),
(null, null, null, null, 5.0, null, null)
""");
// TODO (https://github.com/trinodb/trino/issues/15998) fix selective ANALYZE for table with non-canonical partition values
assertQueryFails("ANALYZE " + getFullyQualifiedTestTableName(tableName) + " WITH (partitions = ARRAY[ARRAY['3']])", "Partition no longer exists: month=3");
assertUpdate("DROP TABLE " + getFullyQualifiedTestTableName(tableName));
}
but unfortunately we stumble on
Caused by: io.trino.spi.TrinoException: Partition no longer exists: month=01
at io.trino.plugin.hive.HiveSplitManager.lambda$getPartitionMetadata$3(HiveSplitManager.java:323)
at com.google.common.collect.Iterators$6.transform(Iterators.java:829)
Probably related to #15998
There was a problem hiding this comment.
we'd probably need here the partitionTypes to do something similar as in io.trino.plugin.hive.HiveMetadata#canonicalizePartitionValues
"month=01" -> would get us "01" to the raw partition value -> which would need to be parsed to "1"
There was a problem hiding this comment.
Changed the code to avoid depending on system.sync_partition_metadata procedure call and instead add the partitions of the table manually through the metastore client.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
3c366c5 to
0d60b26
Compare
plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestHiveOnDataLake.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestHiveOnDataLake.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestHiveOnDataLake.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestHiveOnDataLake.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
0d60b26 to
9ad2b2b
Compare
9ad2b2b to
28d01aa
Compare
37e156a to
561f588
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
5d93025 to
5618afb
Compare
In Hive it may well happen that a partition value is written by the writer process as a string, e.g. : `month=02`, even though the column is registered in Hive as an integer. When updating the table or when doing `ANALYZE`, the output in Trino of the statistics computation though for the partition from the example above will be though `2`. While creating/changing table statistics, the partition values are parsed to Trino `NullableValue`s in order to match, independently of canonicalization, the partition column grouping values from the output of the statistics computation.
5618afb to
cafe1e4
Compare
Description
In Hive it may well happen that a partition value
is written by the writer process as a string,
e.g. :
month=02, even though the column is registered in Hive as an integer.When updating the table or when doing
ANALYZE, the output in Trino of the statistics computation though for the partition from the example above will be though2.While creating/changing table statistics, the partition values are parsed to Trino
NullableValues in order to match, independently of canonicalization, the partition column grouping values from the output of the statistics computation.Additional context and related issues
Assume the following scenario:
10are prefixed with0integertype for the partition columnmonthANALYZEthe following exception occurs:This PR addresses the above mentioned issue by parsing the partition values to Trino values in order to avoid ignoring computed statistics.
Release notes
( ) This is not user-visible or docs only and no release notes are required.
(x) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: