Skip to content

fix(plugin-iceberg): Disable metadata deletion on varbinary columns#27050

Merged
hantangwangd merged 1 commit intoprestodb:masterfrom
hantangwangd:disable_metadata_deletion_on_varbinary
Jan 30, 2026
Merged

fix(plugin-iceberg): Disable metadata deletion on varbinary columns#27050
hantangwangd merged 1 commit intoprestodb:masterfrom
hantangwangd:disable_metadata_deletion_on_varbinary

Conversation

@hantangwangd
Copy link
Copy Markdown
Member

@hantangwangd hantangwangd commented Jan 29, 2026

Description

Due to Iceberg issue apache/iceberg#15128, using a binary type as a partition column may cause incorrect calculation of partition bounds in the generated manifest files when deleting data files. This can lead to incorrect results in subsequent queries.

Therefore, we temporarily disables metadata deletion and filter thoroughly pushdown for varbinary columns. This restrict can be lifted once the Iceberg issue is resolved.

Motivation and Context

Fix the bug when use varbinary columns as partition columns in Iceberg

Impact

This change is not visible to users.

Test Plan

  • Newly added test case in IcebergDistributedTestBase.testPartitionedByVarbinaryType through @DataProvider, which would explicitly fail without this fix.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

Summary by Sourcery

Guard Iceberg plan optimization from enforcing metadata constraints on VARBINARY-partitioned columns and strengthen test coverage for varbinary partitioning behavior.

Bug Fixes:

  • Avoid pushing down column constraints into Iceberg partition specs for VARBINARY columns to prevent incorrect metadata-based deletions and query results when varbinary is used as a partition key.

Tests:

  • Extend the varbinary partitioning integration test to cover multiple insert value orderings and updated expected partition counts via a TestNG data provider.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Jan 29, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Disables Iceberg column constraint enforcement for VARBINARY partition columns to avoid incorrect partition metadata, and strengthens the varbinary-partitioned table test to cover different insert orders and updated partition expectations.

Class diagram for IcebergPlanOptimizer varbinary constraint guard and varbinary partition tests

classDiagram
    class IcebergPlanOptimizer {
        +static boolean canEnforceColumnConstraintInSpecs(IcebergColumnHandle columnHandle, Set~Integer~ partitionSpecIds, IcebergTable table, Domain domain, ConnectorSession session)
    }

    class IcebergColumnHandle {
        +Type getType()
    }

    class IcebergTable {
        +Map~Integer, IcebergPartitionSpec~ specs()
    }

    class IcebergPartitionSpec {
        +int specId()
    }

    class Domain
    class ConnectorSession
    class Type

    IcebergPlanOptimizer --> IcebergColumnHandle : uses
    IcebergPlanOptimizer --> IcebergTable : uses
    IcebergPlanOptimizer --> Domain : uses
    IcebergPlanOptimizer --> ConnectorSession : uses
    IcebergColumnHandle --> Type : uses

    note for IcebergPlanOptimizer "New behavior: canEnforceColumnConstraintInSpecs returns false when columnHandle.getType() is VARBINARY before checking specs()"

    class IcebergDistributedTestBase {
        +void testPartitionedByVarbinaryType(String insertOrder)
        +Object[][] dataProviderForPartitionedByVarbinaryType()
    }

    IcebergDistributedTestBase ..> IcebergTable : creates and verifies
    IcebergDistributedTestBase ..> IcebergPlanOptimizer : indirectly exercises through query planning
Loading

File-Level Changes

Change Details Files
Guard Iceberg partition constraint enforcement to skip VARBINARY partition columns.
  • Update canEnforceColumnConstraintInSpecs to immediately return false when the column type is VARBINARY
  • Rely on existing per-spec constraint logic for all non-VARBINARY types as before
presto-iceberg/src/main/java/com/facebook/presto/iceberg/optimizer/IcebergPlanOptimizer.java
Broaden varbinary partitioning test coverage for Iceberg tables with multiple insert orders and adjusted partition expectations.
  • Introduce a DataProvider that supplies different value insertion orders for the varbinary-partitioned test table
  • Parameterize testPartitionedByVarbinaryType to use the provided insert value permutations
  • Change the expected partition-count assertion from 1 to 2 to reflect separate partitions per distinct varbinary value while keeping existing value/row_count checks
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hantangwangd hantangwangd marked this pull request as ready for review January 29, 2026 14:28
@hantangwangd hantangwangd requested review from a team and ZacBlanco as code owners January 29, 2026 14:28
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In canEnforceColumnConstraintInSpecs, consider using a type helper (e.g., isVarbinaryType(columnHandle.getType())) or comparing against a more general binary category instead of a direct columnHandle.getType() == VARBINARY check, to make the guard resilient to future changes in type instances/mappings.
  • It would be helpful to add a brief code comment near the VARBINARY guard in IcebergPlanOptimizer referencing the upstream Iceberg issue and the behavior being worked around, so future changes know when and why this special case can be removed.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `canEnforceColumnConstraintInSpecs`, consider using a type helper (e.g., `isVarbinaryType(columnHandle.getType())`) or comparing against a more general binary category instead of a direct `columnHandle.getType() == VARBINARY` check, to make the guard resilient to future changes in type instances/mappings.
- It would be helpful to add a brief code comment near the VARBINARY guard in `IcebergPlanOptimizer` referencing the upstream Iceberg issue and the behavior being worked around, so future changes know when and why this special case can be removed.

## Individual Comments

### Comment 1
<location> `presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java:872-873` </location>
<code_context>
         assertEquals(getQueryRunner().execute("select b FROM test_partition_columns_varbinary where b = X'e3bcd1'").getOnlyValue(),
                 new byte[] {(byte) 0xe3, (byte) 0xbc, (byte) 0xd1});
-        assertEquals(getQueryRunner().execute("select count(*) from \"test_partition_columns_varbinary$partitions\"").getOnlyValue(), 1L);
+        assertEquals(getQueryRunner().execute("select count(*) from \"test_partition_columns_varbinary$partitions\"").getOnlyValue(), 2L);
         assertEquals(getQueryRunner().execute("select row_count from \"test_partition_columns_varbinary$partitions\" where b = X'e3bcd1'").getOnlyValue(), 1L);

         assertQuerySucceeds("drop table test_partition_columns_varbinary");
</code_context>

<issue_to_address>
**suggestion (testing):** Assert partition metadata more fully (both partitions and their row counts)

The test now only validates `row_count` for the `X'e3bcd1'` partition. To better verify partition metadata, also assert that the `X'bcd1'` partition exists and that both partitions have the expected `row_count` (e.g., 1 row each). For example, query all partitions and assert the `{b -> row_count}` mapping or that `sum(row_count)` equals the table’s total row count.
</issue_to_address>

### Comment 2
<location> `presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java:840-849` </location>
<code_context>
+                {"(2, X'e3bcd1'), (1, X'bcd1')"}};
+    }
+
+    @Test(dataProvider = "insertValues")
+    public void testPartitionedByVarbinaryType(String insertValues)
     {
</code_context>

<issue_to_address>
**issue (testing):** Add a test that exercises deletes/metadata deletion with varbinary partitions

This test currently only covers inserts/selects, but the bug and fix are about incorrect metadata and partition bounds when deleting from VARBINARY-partitioned tables. Please add or extend a test to:

- Execute a DELETE (e.g., `DELETE FROM test_partition_columns_varbinary WHERE b = X'...'`) that would previously hit the bad constraint enforcement.
- Optionally query `$partitions` afterward to confirm partition metadata and row counts match the table contents.

That will directly exercise the regression scenario for deletes on VARBINARY partitions.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one comment.

Due to Iceberg issue apache/iceberg#15128,
using a binary type as a partition column may cause incorrect
calculation of partition bounds in the generated manifest files when
deleting data files. This can lead to incorrect results in subsequent
queries.

Therefore, we temporarily disables metadata deletion and filter
thoroughly pushdown for varbinary columns. This restrict can be
lifted once the Iceberg issue is resolved.
@hantangwangd hantangwangd force-pushed the disable_metadata_deletion_on_varbinary branch from 66858d9 to 6096a5d Compare January 30, 2026 01:24
@hantangwangd hantangwangd merged commit c012600 into prestodb:master Jan 30, 2026
142 of 144 checks passed
@hantangwangd hantangwangd deleted the disable_metadata_deletion_on_varbinary branch January 30, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants