Skip to content

feat(iceberg): Add $snapshot_id as hidden column in iceberg table#26189

Draft
agrawalreetika wants to merge 1 commit intoprestodb:masterfrom
agrawalreetika:iceberg-snapshotId
Draft

feat(iceberg): Add $snapshot_id as hidden column in iceberg table#26189
agrawalreetika wants to merge 1 commit intoprestodb:masterfrom
agrawalreetika:iceberg-snapshotId

Conversation

@agrawalreetika
Copy link
Copy Markdown
Member

Description

Add $snapshot_id as a hidden column in the iceberg table

Motivation and Context

Add $snapshot_id as a hidden column in the iceberg table
Addresses #26164

Impact

Test Plan

Integration test added

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Iceberg Connector Changes
* Add $snapshot_id as a hidden column in the iceberg table

@agrawalreetika agrawalreetika self-assigned this Sep 30, 2025
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Sep 30, 2025
@prestodb-ci prestodb-ci requested review from a team, pratyakshsharma and sh-shamsan and removed request for a team September 30, 2025 05:59
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Sep 30, 2025

Reviewer's Guide

This PR introduces $snapshot_id as a hidden metadata column by extending split representations and split sources to carry snapshot IDs, integrating the new column into column handles and table metadata, exposing it in page sources, and validating the feature via an integration test.

ER diagram for new $snapshot_id metadata column in table schema

erDiagram
ICEBERG_TABLE {
  VARCHAR $path
  BIGINT $data_sequence_number
  BOOLEAN $deleted
  VARCHAR $delete_file_path
  BIGINT $snapshot_id
}
ICEBERG_TABLE ||--o{ ICEBERG_SPLIT : contains
ICEBERG_SPLIT {
  BIGINT snapshotId
}
Loading

Class diagram for updated IcebergSplit and related classes

classDiagram
class IcebergSplit {
  - long dataSequenceNumber
  - long affinitySchedulingFileSectionSize
  - long affinitySchedulingFileSectionIndex
  + long snapshotId
  + getSnapshotId(): long
}
class ChangelogSplitSource {
  - long snapshotId
  + ChangelogSplitSource(..., long snapshotId)
}
class EqualityDeletesSplitSource {
  - long snapshotId
  + EqualityDeletesSplitSource(..., long snapshotId)
}
class IcebergSplitSource {
  - long snapshotId
  + IcebergSplitSource(...)
}
IcebergSplitSource --> IcebergSplit
ChangelogSplitSource --> IcebergSplit
EqualityDeletesSplitSource --> IcebergSplit
Loading

Class diagram for IcebergColumnHandle and IcebergMetadataColumn changes

classDiagram
class IcebergColumnHandle {
  + static SNAPSHOT_ID_COLUMN_HANDLE: IcebergColumnHandle
  + static SNAPSHOT_ID_COLUMN_METADATA: ColumnMetadata
  + isSnapshotId(): boolean
}
class IcebergMetadataColumn {
  + SNAPSHOT_ID
}
IcebergColumnHandle --> IcebergMetadataColumn
Loading

File-Level Changes

Change Details Files
Propagate snapshotId through split models and sources
  • Added snapshotId field, constructor parameter, and getter in IcebergSplit
  • Introduced snapshotId field and constructor parameter in ChangelogSplitSource and passed it to split creation
  • Introduced snapshotId field and constructor parameter in EqualityDeletesSplitSource and passed it to split creation
  • Added snapshotId field to IcebergSplitSource and initialized it from tableScan
  • Updated IcebergSplitManager to pass snapshotId when creating split sources
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSplit.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/changelog/ChangelogSplitSource.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/equalitydeletes/EqualityDeletesSplitSource.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSplitSource.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSplitManager.java
Define and register snapshot_id as a metadata column
  • Added SNAPSHOT_ID enum entry in IcebergMetadataColumn
  • Created SNAPSHOT_ID_COLUMN_HANDLE and SNAPSHOT_ID_COLUMN_METADATA and isSnapshotId() in IcebergColumnHandle
  • Included SNAPSHOT_ID_COLUMN_METADATA in metadata columns and mapping in IcebergAbstractMetadata
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergMetadataColumn.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergColumnHandle.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java
Expose snapshotId in page sources
  • Added handling for snapshot_id in metadataValues in IcebergPageSourceProvider
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergPageSourceProvider.java
Add integration test for snapshot_id hidden column
  • Introduced testSnapshotIdHiddenColumnSimple to verify distinct $snapshot_id count and current snapshot ID
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider refactoring snapshotId propagation into a common base or builder to avoid repeating it across all split source constructors and reduce boilerplate.
  • Include snapshotId in the split info returned by IcebergSplit#getInfo() so that split logs or traces will clearly show which snapshot each split belongs to for easier debugging.
  • Add an integration test for point-in-time scans (using fromSnapshot/toSnapshot) to verify that $snapshot_id in query results matches the intended historical snapshot, not just the current one.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider refactoring snapshotId propagation into a common base or builder to avoid repeating it across all split source constructors and reduce boilerplate.
- Include snapshotId in the split info returned by IcebergSplit#getInfo() so that split logs or traces will clearly show which snapshot each split belongs to for easier debugging.
- Add an integration test for point-in-time scans (using fromSnapshot/toSnapshot) to verify that $snapshot_id in query results matches the intended historical snapshot, not just the current one.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@agrawalreetika agrawalreetika requested a review from a team as a code owner September 30, 2025 12:16
@agrawalreetika agrawalreetika force-pushed the iceberg-snapshotId branch 3 times, most recently from 6d42a74 to 17c11a0 Compare October 1, 2025 03:39
@tdcmeehan
Copy link
Copy Markdown
Contributor

In cases where it's unambiguous to do so, this should also push down into Iceberg via ConnectorMetada#getLayoutForConstraint, and essentially turn the table handle itself into a time travel query.

Comment on lines -146 to +148
affinitySchedulingFileSectionSize);
affinitySchedulingFileSectionSize,
snapshotId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the snapshotId selection here. It seems like we are using the table-level snapshotId taken when the entire table was scanned, but my understanding is that it should be the snapshotId calculated based on the corresponding data file and delete files, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd this is an important observation. Selecting this column would be more useful if it returned the snapshot ID of the data file, i.e. which snapshot ID created the file. However, this column is primarily intended for filtering, as a way of altering the table handle to force a time travel on the table without introducing a new SPI or connector optimizer. Given this column will be hidden and not intended for direct use, I am comfortable with this being the snapshot ID of the scan, as that fulfills the intended purpose.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan thanks for the detailed explanation. Based on my understanding of PR #26164 and the comments here, the primary purpose of this $snapshot_id column is to enable predicate pushdown for the filter WHERE $snapshot_id > xxx which is used to query incremental data since a specified snapshot. Therefore column $snapshot_id should be disallowed to be specified directly in a query, and shouldn't exist in any filter node which couldn't be completely pushdown to Iceberg connector, is this correct?

@tdcmeehan
Copy link
Copy Markdown
Contributor

One additional thing we should probably do is fail in case any predicate is provided which compares the snapshot ID to any non-constant value, or any less than predicate is supplied, as they're just too dangerous. Only greater than should be supported, since this will use the latest schema.

@agrawalreetika agrawalreetika changed the title Add $snapshot_id as hidden column in iceberg table feat(iceberg): Add $snapshot_id as hidden column in iceberg table Oct 6, 2025
// Only support >= X
Optional<Long> lower = Optional.of(((Number) range.getLowBoundedValue()).longValue());
handle = handle.withUpdatedIcebergTableName(
new IcebergTableName(name.getTableName(), name.getTableType(), lower, name.getChangelogEndSnapshot()));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan I wanted to confirm a point here -
Currently, I am updating IcebergTableHandle with the lower bound (X) here (for >= X), but shouldn't we use the latest available snapshot whose ID >= X when the query predicate is $snapshot_id >= X instead? So it ensures we always read using the most recent snapshot schema, which avoids issues that can occur if older snapshots have outdated or incompatible schemas.

session.getRuntimeStats());

return new EqualityDeletesSplitSource(session, icebergTable, deleteFiles);
return new EqualityDeletesSplitSource(session, icebergTable, deleteFiles, table.getIcebergTableName().getSnapshotId().get());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSnapshotId() returns Optional.
Need to check isPresent first before calling get.

new IcebergTableName(name.getTableName(), name.getTableType(), lower, name.getChangelogEndSnapshot()));
}
else {
throw new PrestoException(NOT_SUPPORTED, "Unsupported predicate for $snapshot_id; only >= constant is allowed");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the message to >= and = since both of them are supported.

});
}

@Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a case where there are multiple snapshots?


if (domain.isSingleValue()) {
Optional<Long> snapshotId = Optional.of(((Number) domain.getSingleValue()).longValue());
handle = handle.withUpdatedIcebergTableName(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we are querying time travel tables? Here we will always overwrite the snapshotId.
Probably we should add some check for the snapshot in predicate and the snapshot specified in time travel.

@steveburnett
Copy link
Copy Markdown
Contributor

Should $snapshot_id be added to the Iceberg doc about hidden columns?

@agrawalreetika
Copy link
Copy Markdown
Member Author

@PingLiuPing Thanks for the review, but having snapshot_id with Filter will have issue. As $snapshot_id is not incremental so while calculating delta between 2 snapshots (With query like WHERE $snapshot_id BETWEEN snap1 AND snap2) would return wrong results since comparison is not valid.

Could you please review #26408 which is based on $snapshot_sequence_number?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants