Skip to content

Fix ANALYZE when Hive partition has non-canonical value#15995

Merged
ebyhr merged 2 commits intotrinodb:masterfrom
findinpath:findinpath/hive-analyze-canonicalization
Feb 27, 2023
Merged

Fix ANALYZE when Hive partition has non-canonical value#15995
ebyhr merged 2 commits intotrinodb:masterfrom
findinpath:findinpath/hive-analyze-canonicalization

Conversation

@findinpath
Copy link
Contributor

Description

In Hive it may well happen that a partition value
is written by the writer process as a string,
e.g. : month=02, even though the column is registered in Hive as an integer.
When updating the table or when doing ANALYZE, the output in Trino of the statistics computation though for the partition from the example above will be though 2.

While creating/changing table statistics, the partition values are parsed to Trino NullableValues in order to match, independently of canonicalization, the partition column grouping values from the output of the statistics computation.

Additional context and related issues

Assume the following scenario:

  • a Spark job outputs content grouped by month on the storage - the months smaller than 10 are prefixed with 0
  • an external table is created in hive using the integer type for the partition column month
  • while performing ANALYZE the following exception occurs:
com.google.common.base.VerifyException: All computed statistics must be used
at com.google.common.base.Verify.verify(Verify.java:126)
at io.trino.plugin.hive.HiveMetadata.finishStatisticsCollection(HiveMetadata.java:1307)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.finishStatisticsCollection(ClassLoaderSafeConnectorMetadata.java:158)
at io.trino.metadata.MetadataManager.finishStatisticsCollection(MetadataManager.java:854)

This PR addresses the above mentioned issue by parsing the partition values to Trino values in order to avoid ignoring computed statistics.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
(x) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Hive
* Avoid ignoring computed table statistics in `ANALYZE` 

@cla-bot cla-bot bot added the cla-signed label Feb 6, 2023
@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch from d083fdb to 24eec9b Compare February 6, 2023 23:46
@findinpath findinpath marked this pull request as ready for review February 6, 2023 23:47
@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch 2 times, most recently from e7941f6 to 2f162b5 Compare February 7, 2023 11:25
@findinpath findinpath changed the title Parse partition values to Trino values while upserting table statistics Canonicalize partition values to Trino values while upserting table statistics Feb 7, 2023
@findinpath findinpath changed the title Canonicalize partition values to Trino values while upserting table statistics Canonicalize partition values while upserting table statistics Feb 7, 2023
@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch from 2f162b5 to 1f88aee Compare February 7, 2023 11:26
@findepi
Copy link
Member

findepi commented Feb 7, 2023

Canonicalize partition values while upserting table statistics

Change to something like "Fix ANALYZE when Hive partition has non-canonical value"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need that in finishCreateTable flow?

I think all partition names and value lists are canonical (because we created them).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we dont' need this, but then worth adding a code comment why we don't need this

Copy link
Contributor Author

@findinpath findinpath Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I used an assignment to a canonicalPartitionValues to underline the fact that the partition values are already canonical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're relying on sync_partition_metadata behavior, which is implicit & subject to change
in fact, sync_partition_metadata could canonicalize the values inferred from the storage, why not

there should be some onHive way to enter into this non-canonical partitions state.
Or, we could use HMS client directly (product tests can do that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the test to the form:

    @Test
    public void testAnalyzePartitionedTableWithCanonicalization()
    {
        String tableName = "test_analyze_table_canonicalization_" + randomNameSuffix();
        assertUpdate("CREATE TABLE %s (a_varchar varchar, month integer) WITH (partitioned_by = ARRAY['month'])".formatted(getFullyQualifiedTestTableName(tableName)));

        assertUpdate("INSERT INTO " + getFullyQualifiedTestTableName(tableName) + " VALUES ('A', 1), ('AA', 1), ('B', 2), ('M', 10)", 4);

        // Simulate the fact that the Hive table has non-canonical partition values
        hiveMinioDataLake.getHiveHadoop().runOnMetastore(
                """
                UPDATE PARTITIONS
                SET part_name='month=01'
                WHERE
                   TBL_ID IN (SELECT tbl_id FROM TBLS t INNER JOIN DBS db ON t.db_id = db.db_id WHERE db.name = '%s' and t.tbl_name = '%s') AND
                   PART_NAME='month=1'
                """.formatted(HIVE_TEST_SCHEMA, tableName));
        assertQuery("SELECT * FROM " + HIVE_TEST_SCHEMA + ".\"" + tableName + "$partitions\"", "VALUES 1, 2, 10");
        assertUpdate("ANALYZE " + getFullyQualifiedTestTableName(tableName), 4);
        assertQuery("SHOW STATS FOR " + getFullyQualifiedTestTableName(tableName),
                """
                        VALUES
                            ('a_varchar', 5.0, 2.0, 0.0, null, null, null),
                            ('month', null, 3.0, 0.0, null, 1, 10),
                            (null, null, null, null, 4.0, null, null)
                        """);

        assertUpdate("INSERT INTO " + getFullyQualifiedTestTableName(tableName) + " VALUES ('C', 3)", 1);
        hiveMinioDataLake.getHiveHadoop().runOnMetastore(
                """
                UPDATE PARTITIONS
                SET part_name='month=03'
                WHERE
                   TBL_ID IN (SELECT tbl_id FROM TBLS t INNER JOIN DBS db ON t.db_id = db.db_id WHERE db.name = '%s' and t.tbl_name = '%s') AND
                   PART_NAME='month=3'
                """.formatted(HIVE_TEST_SCHEMA, tableName));

        assertUpdate("ANALYZE " + getFullyQualifiedTestTableName(tableName) + " WITH (partitions = ARRAY[ARRAY['03']])", 1);
        assertQuery("SHOW STATS FOR " + getFullyQualifiedTestTableName(tableName),
                """
                        VALUES
                            ('a_varchar', 6.0, 2.0, 0.0, null, null, null),
                            ('month', null, 4.0, 0.0, null, 1, 10),
                            (null, null, null, null, 5.0, null, null)
                        """);
        // TODO (https://github.com/trinodb/trino/issues/15998) fix selective ANALYZE for table with non-canonical partition values
        assertQueryFails("ANALYZE " + getFullyQualifiedTestTableName(tableName) + " WITH (partitions = ARRAY[ARRAY['3']])", "Partition no longer exists: month=3");

        assertUpdate("DROP TABLE " + getFullyQualifiedTestTableName(tableName));
    }

but unfortunately we stumble on

Caused by: io.trino.spi.TrinoException: Partition no longer exists: month=01
	at io.trino.plugin.hive.HiveSplitManager.lambda$getPartitionMetadata$3(HiveSplitManager.java:323)
	at com.google.common.collect.Iterators$6.transform(Iterators.java:829)

Probably related to #15998

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map<String, List<String>> partitionNameToPartitionValuesMap = partitionNames.stream()
.collect(Collectors.toMap(identity(), HiveUtil::toPartitionValues));

we'd probably need here the partitionTypes to do something similar as in io.trino.plugin.hive.HiveMetadata#canonicalizePartitionValues

"month=01" -> would get us "01" to the raw partition value -> which would need to be parsed to "1"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code to avoid depending on system.sync_partition_metadata procedure call and instead add the partitions of the table manually through the metastore client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( #15998 )

@findinpath findinpath changed the title Canonicalize partition values while upserting table statistics Fix ANALYZE when Hive partition has non-canonical value Feb 8, 2023
@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch 3 times, most recently from 3c366c5 to 0d60b26 Compare February 10, 2023 16:49
@findinpath findinpath requested a review from ebyhr February 10, 2023 16:50
Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to me.

@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch from 0d60b26 to 9ad2b2b Compare February 15, 2023 13:28
@findinpath findinpath self-assigned this Feb 15, 2023
@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch from 9ad2b2b to 28d01aa Compare February 16, 2023 05:34
@findinpath findinpath requested a review from ebyhr February 16, 2023 05:38
@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch 2 times, most recently from 37e156a to 561f588 Compare February 22, 2023 12:29
@findinpath findinpath force-pushed the findinpath/hive-analyze-canonicalization branch 2 times, most recently from 5d93025 to 5618afb Compare February 22, 2023 14:57
In Hive it may well happen that a partition value
is written by the writer process as a string,
e.g. : `month=02`, even though the column is registered
in Hive as an integer.
When updating the table or when doing `ANALYZE`, the output
in Trino of the statistics computation though for the partition
from the example above will be though `2`.

While creating/changing table statistics, the partition values are
parsed to Trino `NullableValue`s in order to match, independently of
canonicalization, the partition column grouping values from the output
of the statistics computation.
@ebyhr ebyhr force-pushed the findinpath/hive-analyze-canonicalization branch from 5618afb to cafe1e4 Compare February 24, 2023 06:49
@findinpath findinpath requested a review from ebyhr February 27, 2023 06:25
@ebyhr ebyhr merged commit ba259ae into trinodb:master Feb 27, 2023
@github-actions github-actions bot added this to the 409 milestone Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants