Skip to content

feat(connector): Add support for AWS Glue Table and Column Statistics#26297

Closed
imjalpreet wants to merge 1 commit intoprestodb:masterfrom
imjalpreet:glueStatistics
Closed

feat(connector): Add support for AWS Glue Table and Column Statistics#26297
imjalpreet wants to merge 1 commit intoprestodb:masterfrom
imjalpreet:glueStatistics

Conversation

@imjalpreet
Copy link
Copy Markdown
Member

@imjalpreet imjalpreet commented Oct 13, 2025

Description

Add support for AWS Glue Table and Column Statistics

Motivation and Context

Based on trinodb/trino@f1bcfa7

Impact

Users will be able to utilize statistics and enable CBO when using AWS Glue as a metastore.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Add support for AWS Glue Table and Column Statistics

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Oct 13, 2025
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Oct 13, 2025

Reviewer's Guide

This PR adds comprehensive AWS Glue table and column statistics support by introducing a pluggable GlueColumnStatisticsProvider abstraction (with enabled/disabled implementations), wiring new executors and configuration flags, extending GlueHiveMetastore to delegate statistics operations (read and write), improving batch partition fetch and update logic, extending the updatePartitionStatistics API across implementations, and providing converter utilities and necessary test updates.

Sequence diagram for updating partition statistics with column statistics in GlueHiveMetastore

sequenceDiagram
participant "Caller"
participant "GlueHiveMetastore"
participant "DefaultGlueColumnStatisticsProvider"
participant "AWSGlueAsync"
"Caller"->>"GlueHiveMetastore": updatePartitionStatistics(...)
"GlueHiveMetastore"->>"DefaultGlueColumnStatisticsProvider": getPartitionColumnStatistics(partitions)
"DefaultGlueColumnStatisticsProvider"->>"AWSGlueAsync": GetColumnStatisticsForPartition
"AWSGlueAsync"-->>"DefaultGlueColumnStatisticsProvider": Partition column stats
"GlueHiveMetastore"->>"AWSGlueAsync": batchUpdatePartitionAsync
"GlueHiveMetastore"->>"DefaultGlueColumnStatisticsProvider": updatePartitionStatistics(updates)
"DefaultGlueColumnStatisticsProvider"->>"AWSGlueAsync": UpdateColumnStatisticsForPartition
"DefaultGlueColumnStatisticsProvider"->>"AWSGlueAsync": DeleteColumnStatisticsForPartition (if needed)
"AWSGlueAsync"-->>"DefaultGlueColumnStatisticsProvider": Update/Delete result
"GlueHiveMetastore"-->>"Caller": Done
Loading

ER diagram for GlueHiveMetastoreConfig statistics-related properties

erDiagram
GLUE_HIVE_METASTORE_CONFIG {
  bool columnStatisticsEnabled
  int readStatisticsThreads
  int writeStatisticsThreads
}
GLUE_HIVE_METASTORE_CONFIG ||--o| GLUE_HIVE_METASTORE : "configures"
Loading

Class diagram for new and updated Glue column statistics support

classDiagram
class GlueHiveMetastore {
  - GlueColumnStatisticsProvider columnStatisticsProvider
  - boolean enableColumnStatistics
  + getSupportedColumnStatistics(type)
  + getTableStatistics(...)
  + getPartitionStatistics(...)
  + updateTableStatistics(...)
  + updatePartitionStatistics(...)
}
class GlueColumnStatisticsProvider {
  <<interface>>
  + getSupportedColumnStatistics(type)
  + getTableColumnStatistics(table)
  + getPartitionColumnStatistics(partitions)
  + updateTableColumnStatistics(table, columnStatistics)
  + updatePartitionStatistics(partitionStatisticsUpdates)
}
class DefaultGlueColumnStatisticsProvider {
  + getSupportedColumnStatistics(type)
  + getTableColumnStatistics(table)
  + getPartitionColumnStatistics(partitions)
  + updateTableColumnStatistics(table, columnStatistics)
  + updatePartitionStatistics(partitionStatisticsUpdates)
}
class DisabledGlueColumnStatisticsProvider {
  + getSupportedColumnStatistics(type)
  + getTableColumnStatistics(table)
  + getPartitionColumnStatistics(partitions)
  + updateTableColumnStatistics(table, columnStatistics)
  + updatePartitionStatistics(partitionStatisticsUpdates)
}
GlueHiveMetastore --> GlueColumnStatisticsProvider
GlueColumnStatisticsProvider <|.. DefaultGlueColumnStatisticsProvider
GlueColumnStatisticsProvider <|.. DisabledGlueColumnStatisticsProvider
class GlueHiveMetastoreConfig {
  + boolean columnStatisticsEnabled
  + int readStatisticsThreads
  + int writeStatisticsThreads
  + setColumnStatisticsEnabled(...)
  + setReadStatisticsThreads(...)
  + setWriteStatisticsThreads(...)
}
class GlueMetastoreModule {
  + createStatisticsReadExecutor(...)
  + createStatisticsWriteExecutor(...)
}
class ForGlueColumnStatisticsRead {
  <<annotation>>
}
class ForGlueColumnStatisticsWrite {
  <<annotation>>
}
GlueMetastoreModule --> ForGlueColumnStatisticsRead
GlueMetastoreModule --> ForGlueColumnStatisticsWrite
DefaultGlueColumnStatisticsProvider --> GlueMetastoreStats
class GlueMetastoreStats {
  + getGetColumnStatisticsForTable()
  + getGetColumnStatisticsForPartition()
  + getUpdateColumnStatisticsForTable()
  + getDeleteColumnStatisticsForTable()
  + getUpdateColumnStatisticsForPartition()
  + getDeleteColumnStatisticsForPartition()
}
Loading

File-Level Changes

Change Details Files
Introduce column statistics provider abstraction and DI
  • Define GlueColumnStatisticsProvider interface with methods for table and partition stats
  • Implement DefaultGlueColumnStatisticsProvider and DisabledGlueColumnStatisticsProvider
  • Add @ForGlueColumnStatisticsRead and @ForGlueColumnStatisticsWrite qualifiers
  • Bind the provider in GlueMetastoreModule for dependency injection
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/GlueColumnStatisticsProvider.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/DefaultGlueColumnStatisticsProvider.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/DisabledGlueColumnStatisticsProvider.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/ForGlueColumnStatisticsRead.java
presto-hive-metastore/src/main/java/com/facebook/presto/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/ForGlueColumnStatisticsWrite.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/GlueMetastoreModule.java
Enhance GlueHiveMetastore to integrate column statistics and improve batch partition operations
  • Add columnStatisticsProvider field, enableColumnStatistics flag, and inject separate executors
  • Rename generic executor to partitionsReadExecutor and provide dedicated stats executors
  • Override getSupportedColumnStatistics, getTableStatistics, getPartitionStatistics to fetch column stats
  • Update methods updateTableStatistics, updatePartitionStatistics, addPartitions, alterPartition to delegate column stats writes
  • Implement batchUpdatePartitionStatisticsBatch with BATCH_UPDATE_PARTITION_MAX_PAGE_SIZE
  • Refine batchGetPartition to handle unprocessed keys in a loop and reuse Immutable converters
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/GlueHiveMetastore.java
Add configuration options and metrics for column statistics
  • Introduce hive.metastore.glue.column-statistics-enabled, read-statistics-threads, write-statistics-threads in GlueHiveMetastoreConfig with validation
  • Extend GlueMetastoreStats with get/update/delete metrics for table and partition statistics
  • Update TestGlueHiveMetastoreConfig to assert new default and explicit mappings
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/GlueHiveMetastoreConfig.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/GlueMetastoreStats.java
presto-hive-metastore/src/test/java/com/facebook/presto/hive/metastore/glue/TestGlueHiveMetastoreConfig.java
Extend updatePartitionStatistics API and propagate signature change
  • Change ExtendedHiveMetastore.updatePartitionStatistics to accept Map<String,Function> for bulk updates
  • Provide default overload for single-partition updates
  • Adapt FileHiveMetastore, InMemoryCachingHiveMetastore, BridgingHiveMetastore, RecordingHiveMetastore, UnimplementedHiveMetastore to new signature
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/ExtendedHiveMetastore.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file/FileHiveMetastore.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/InMemoryCachingHiveMetastore.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/thrift/BridgingHiveMetastore.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/RecordingHiveMetastore.java
presto-hive-metastore/src/test/java/com/facebook/presto/hive/metastore/UnimplementedHiveMetastore.java
Add converter utilities for Glue column statistics and utility enhancements
  • Introduce GlueStatConverter for mapping HiveColumnStatistics to AWS ColumnStatisticsData and back
  • Add MetastoreUtil.makePartitionName helper
  • Expose toMetastoreDistinctValuesCount and getAverageColumnLength in ThriftMetastoreUtil
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/converter/GlueStatConverter.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/MetastoreUtil.java
presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java
Adjust tests for new statistics support and behavior
  • Enable column-statistics-enabled in TestHiveClientGlueMetastore setup
  • Update AbstractTestHiveClient expected statistics values
  • Remove or skip tests for unsupported column-level behavior
presto-hive/src/test/java/com/facebook/presto/hive/metastore/glue/TestHiveClientGlueMetastore.java
presto-hive/src/test/java/com/facebook/presto/hive/AbstractTestHiveClient.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Co-authored-by: Deepak Majeti <majeti.deepak@gmail.com>
Co-authored-by: George Wang <fgwang7w@gmail.com>
@yingsu00
Copy link
Copy Markdown
Contributor

@imjalpreet Thank you for the PR! I'll review it tomorrow. @agrawalreetika Will you be able to review it first?

@yingsu00 yingsu00 self-assigned this Oct 24, 2025
public static List<ColumnStatistics> toGlueColumnStatistics(
Partition partition,
Map<String,
HiveColumnStatistics> trinoColumnStats,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "trino". Same to other places and files too.
Is this file ported over from Trino? If yes, please add co-authored-by section in the PR and commit message.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I missed this. Yes, this class and a subset of these changes are part of a Trino PR. Additionally, we have modified the implementation to have a more optimized version. I will also add the PR details.

@yingsu00 yingsu00 marked this pull request as ready for review October 25, 2025 23:02
@yingsu00 yingsu00 requested a review from a team as a code owner October 25, 2025 23:02
@prestodb-ci prestodb-ci requested review from a team and ScrapCodes and removed request for a team October 25, 2025 23:02
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In batchGetPartition the while‐loop can spin indefinitely if Glue keeps returning unprocessed keys but no partitions; consider adding a max retry or bail-out condition to avoid infinite loops.
  • The hard-coded batch sizes (BATCH_GET_PARTITION_MAX_PAGE_SIZE, BATCH_UPDATE_PARTITION_MAX_PAGE_SIZE) may not match Glue’s actual limits or customer workloads—consider making them configurable or aligning them precisely with AWS Glue API docs.
  • Verify that the new default implementation of updatePartitionStatistics(single-partition) correctly delegates to the multi-partition overload so existing code paths and third-party metastore implementations remain fully compatible.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In batchGetPartition the while‐loop can spin indefinitely if Glue keeps returning unprocessed keys but no partitions; consider adding a max retry or bail-out condition to avoid infinite loops.
- The hard-coded batch sizes (BATCH_GET_PARTITION_MAX_PAGE_SIZE, BATCH_UPDATE_PARTITION_MAX_PAGE_SIZE) may not match Glue’s actual limits or customer workloads—consider making them configurable or aligning them precisely with AWS Glue API docs.
- Verify that the new default implementation of updatePartitionStatistics(single-partition) correctly delegates to the multi-partition overload so existing code paths and third-party metastore implementations remain fully compatible.

## Individual Comments

### Comment 1
<location> `presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/glue/GlueHiveMetastore.java:437-438` </location>
<code_context>
-        if (!updatedStatistics.getColumnStatistics().isEmpty()) {
-            throw new PrestoException(NOT_SUPPORTED, "Glue metastore does not support column level statistics");
-        }
+        Iterables.partition(updates.entrySet(), BATCH_CREATE_PARTITION_MAX_PAGE_SIZE).forEach(partitionUpdates ->
+                updatePartitionStatisticsBatch(metastoreContext, databaseName, tableName, partitionUpdates.stream().collect(toImmutableMap(Entry::getKey, Entry::getValue))));
+    }
</code_context>

<issue_to_address>
**suggestion:** Partition statistics update batching uses BATCH_CREATE_PARTITION_MAX_PAGE_SIZE for both create and update.

Use BATCH_UPDATE_PARTITION_MAX_PAGE_SIZE for update batching to match the update API's limits.

```suggestion
        Iterables.partition(updates.entrySet(), BATCH_UPDATE_PARTITION_MAX_PAGE_SIZE).forEach(partitionUpdates ->
                updatePartitionStatisticsBatch(metastoreContext, databaseName, tableName, partitionUpdates.stream().collect(toImmutableMap(Entry::getKey, Entry::getValue))));
```
</issue_to_address>

### Comment 2
<location> `presto-hive/src/test/java/com/facebook/presto/hive/metastore/glue/TestHiveClientGlueMetastore.java:178-181` </location>
<code_context>
         HdfsConfiguration hdfsConfiguration = new HiveHdfsConfiguration(new HdfsConfigurationInitializer(hiveClientConfig, metastoreClientConfig), ImmutableSet.of(), hiveClientConfig);
         HdfsEnvironment hdfsEnvironment = new HdfsEnvironment(hdfsConfiguration, metastoreClientConfig, new NoHdfsAuthentication());
-        GlueHiveMetastoreConfig glueConfig = new GlueHiveMetastoreConfig();
+        GlueHiveMetastoreConfig glueConfig = new GlueHiveMetastoreConfig().setColumnStatisticsEnabled(true);
         glueConfig.setDefaultWarehouseDir(tempDir.toURI().toString());

-        return new GlueHiveMetastore(hdfsEnvironment, glueConfig, executor);
+        return new GlueHiveMetastore(hdfsEnvironment, glueConfig, executor, executor, executor);
     }

</code_context>

<issue_to_address>
**suggestion (testing):** Test setup enables column statistics, but lacks direct tests for Glue column statistics behavior.

Please add or update tests to directly verify Glue column statistics functionality, including reading, writing, error handling, and propagation through the metastore.

Suggested implementation:

```java
    @Test
    public void testGlueColumnStatisticsWriteAndRead() throws Exception {
        // Setup: create table and column statistics
        String databaseName = "test_db";
        String tableName = "test_table";
        String columnName = "test_column";
        HiveColumnStatistics stats = HiveColumnStatistics.builder()
                .setNullsCount(1L)
                .setDistinctValuesCount(2L)
                .setMaxValue(Optional.of("z"))
                .setMinValue(Optional.of("a"))
                .build();

        // Create database and table
        glueMetastore.createDatabase(new HiveDatabase(databaseName, "owner", "location", ImmutableMap.of()));
        glueMetastore.createTable(new HiveTable(databaseName, tableName, "owner", "location", ImmutableList.of(), ImmutableMap.of()));

        // Write column statistics
        glueMetastore.updateTableColumnStatistics(databaseName, tableName, columnName, stats);

        // Read column statistics
        Optional<HiveColumnStatistics> readStats = glueMetastore.getTableColumnStatistics(databaseName, tableName, ImmutableList.of(columnName)).get(columnName);

        assertTrue(readStats.isPresent(), "Column statistics should be present");
        assertEquals(readStats.get().getNullsCount(), stats.getNullsCount(), "Nulls count should match");
        assertEquals(readStats.get().getDistinctValuesCount(), stats.getDistinctValuesCount(), "Distinct values count should match");
        assertEquals(readStats.get().getMaxValue(), stats.getMaxValue(), "Max value should match");
        assertEquals(readStats.get().getMinValue(), stats.getMinValue(), "Min value should match");
    }

    @Test
    public void testGlueColumnStatisticsErrorHandling() {
        String databaseName = "nonexistent_db";
        String tableName = "nonexistent_table";
        String columnName = "nonexistent_column";

        // Attempt to read statistics for non-existent table/column
        try {
            glueMetastore.getTableColumnStatistics(databaseName, tableName, ImmutableList.of(columnName));
            fail("Expected exception for non-existent table/column");
        } catch (Exception e) {
            // Expected: verify error propagation
            assertTrue(e.getMessage().contains("not found") || e.getMessage().contains("does not exist"));
        }
    }

    @Test
    public void testGlueColumnStatisticsPropagation() throws Exception {
        String databaseName = "propagation_db";
        String tableName = "propagation_table";
        String columnName = "propagation_column";
        HiveColumnStatistics stats = HiveColumnStatistics.builder()
                .setNullsCount(5L)
                .setDistinctValuesCount(10L)
                .build();

        glueMetastore.createDatabase(new HiveDatabase(databaseName, "owner", "location", ImmutableMap.of()));
        glueMetastore.createTable(new HiveTable(databaseName, tableName, "owner", "location", ImmutableList.of(), ImmutableMap.of()));
        glueMetastore.updateTableColumnStatistics(databaseName, tableName, columnName, stats);

        // Simulate propagation: update stats and verify new value
        HiveColumnStatistics updatedStats = HiveColumnStatistics.builder()
                .setNullsCount(7L)
                .setDistinctValuesCount(12L)
                .build();
        glueMetastore.updateTableColumnStatistics(databaseName, tableName, columnName, updatedStats);

        Optional<HiveColumnStatistics> readStats = glueMetastore.getTableColumnStatistics(databaseName, tableName, ImmutableList.of(columnName)).get(columnName);
        assertTrue(readStats.isPresent(), "Column statistics should be present after update");
        assertEquals(readStats.get().getNullsCount(), updatedStats.getNullsCount(), "Updated nulls count should match");
        assertEquals(readStats.get().getDistinctValuesCount(), updatedStats.getDistinctValuesCount(), "Updated distinct values count should match");
    }

```

- Ensure that the `glueMetastore` instance is properly initialized and available in the test class.
- If the test database/table creation or statistics update methods differ in your codebase, adjust the method calls accordingly.
- You may need to import relevant classes such as `HiveColumnStatistics`, `ImmutableList`, and assertion methods.
- If you use a different assertion library, replace `assertTrue`, `assertEquals`, and `fail` with your project's equivalents.
</issue_to_address>

### Comment 3
<location> `presto-hive/src/test/java/com/facebook/presto/hive/metastore/glue/TestHiveClientGlueMetastore.java:201-204` </location>
<code_context>
-
-    @Override
-    public void testUpdatePartitionColumnStatistics()
+    public void testUpdateTableColumnStatisticsEmptyOptionalFields() throws Exception
     {
-        // column statistics are not supported by Glue
+        // this test expects consistency between written and read stats but this is not provided by glue at the moment
+        // when writing empty min/max statistics glue will return 0 to the readers
+        // in order to avoid incorrect data we skip writes for statistics with min/max = null
     }
</code_context>

<issue_to_address>
**suggestion (testing):** Edge case for empty min/max statistics is acknowledged but not tested.

Consider adding a test that verifies Glue's handling of empty min/max statistics, ensuring the system responds correctly and helping to catch future regressions if Glue's behavior changes.

Suggested implementation:

```java
    @Test
    public void testGlueReturnsZeroForEmptyMinMaxStatistics() throws Exception
    {
        // Setup: create a table and column statistics with null min/max
        String databaseName = "test_db";
        String tableName = "test_table";
        String columnName = "test_column";
        // Create table and column if necessary (assume helper methods exist)
        createTestTable(databaseName, tableName, columnName);

        // Write column statistics with null min/max
        HiveColumnStatistics statsWithNullMinMax = HiveColumnStatistics.builder()
                .setMin(null)
                .setMax(null)
                .setNullsCount(0)
                .setDistinctValuesCount(0)
                .build();
        glueMetastore.updateTableColumnStatistics(databaseName, tableName, columnName, statsWithNullMinMax);

        // Read back statistics
        Optional<HiveColumnStatistics> readStats = glueMetastore.getTableColumnStatistics(databaseName, tableName, columnName);

        // Assert that Glue returns 0 for min/max
        assertTrue(readStats.isPresent(), "Statistics should be present");
        assertEquals(readStats.get().getMin(), 0, "Glue should return 0 for min when written as null");
        assertEquals(readStats.get().getMax(), 0, "Glue should return 0 for max when written as null");
    }

    @Override
    }

```

- You may need to implement or adjust helper methods like `createTestTable` and ensure `glueMetastore` is properly initialized for the test.
- Adjust the builder and assertion logic to match your actual `HiveColumnStatistics` API and types.
- If your statistics type is not integer, update the expected value and type accordingly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +437 to +438
Iterables.partition(updates.entrySet(), BATCH_CREATE_PARTITION_MAX_PAGE_SIZE).forEach(partitionUpdates ->
updatePartitionStatisticsBatch(metastoreContext, databaseName, tableName, partitionUpdates.stream().collect(toImmutableMap(Entry::getKey, Entry::getValue))));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Partition statistics update batching uses BATCH_CREATE_PARTITION_MAX_PAGE_SIZE for both create and update.

Use BATCH_UPDATE_PARTITION_MAX_PAGE_SIZE for update batching to match the update API's limits.

Suggested change
Iterables.partition(updates.entrySet(), BATCH_CREATE_PARTITION_MAX_PAGE_SIZE).forEach(partitionUpdates ->
updatePartitionStatisticsBatch(metastoreContext, databaseName, tableName, partitionUpdates.stream().collect(toImmutableMap(Entry::getKey, Entry::getValue))));
Iterables.partition(updates.entrySet(), BATCH_UPDATE_PARTITION_MAX_PAGE_SIZE).forEach(partitionUpdates ->
updatePartitionStatisticsBatch(metastoreContext, databaseName, tableName, partitionUpdates.stream().collect(toImmutableMap(Entry::getKey, Entry::getValue))));

@agrawalreetika
Copy link
Copy Markdown
Member

@imjalpreet could you please add Trino cherry-pick commits as well, whichever is relevant?

}
}

private Optional<Map<String, HiveColumnStatistics>> getPartitionColumnStatisticsIfPresent(Partition partition)
Copy link
Copy Markdown
Member

@agrawalreetika agrawalreetika Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

@imjalpreet imjalpreet changed the title feat(aws-glue): Add support for AWS Glue Table and Column Statistics feat(connector): Add support for AWS Glue Table and Column Statistics Oct 29, 2025
@imjalpreet
Copy link
Copy Markdown
Member Author

This PR needs some implementation changes to work with AWS SDK v2, as we are working on the upgrade: #26670. I will re-raise this feature as a separate PR with the updated implementation once the upgrade PR is merged.

@imjalpreet imjalpreet closed this Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants