Skip to content

Conversation

@danielcweeks
Copy link
Contributor

@danielcweeks danielcweeks commented Nov 5, 2025

This is an alternative to #14503 which introduces validation at the TableMetadata and provides the most information to a client as to what the refreshed and updated state is.

This PR exposes only the snapshot ancestry for the base and updated metadata, which fits more cleanly with the public API, but is more limited in terms of what can be validated.

@aiborodin
Copy link
Contributor

aiborodin commented Nov 6, 2025

Thank you @danielcweeks for the change and @rdblue for the review.

I implemented an alternative API that works in Kafka Connect and reuses most of the existing validation code. It is in the following two PRs:

  1. The suggested new validation API: Implement Snapshot validation API for commits #14514.
  2. The fix for Kafka Connect: Validate parent snapshots in Kafka Connect #14515.

@rdblue @danielcweeks Could you please take a look and let me know what you think?

@danielcweeks danielcweeks force-pushed the snapshot-update-validation branch from 7ab8370 to c3d49e4 Compare November 6, 2025 22:38
* @return boolean for whether the update is valid
*/
@Override
Boolean apply(Iterable<Snapshot> baseSnapshots);
Copy link
Contributor

@aiborodin aiborodin Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to pass the full Snapshot history to fix the Kafka Connect issue. This method can accept only new Snapshots as implemented in #14515.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think what you have in #14515 is quite correct. There's no guarantee (and it's very commonly not the case) that the offsets will be in the prior commit.

Copy link
Contributor

@aiborodin aiborodin Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee (and it's very commonly not the case) that the offsets will be in the prior commit.

#14515 checks all new commits when configured with the starting snapshot id.
Wouldn't the current approach result in unnecessary checks of the whole Snapshot history on every validation run?

Copy link
Contributor

@aiborodin aiborodin Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double-checked the implementation of SnapshotUtil::ancestorsOf() - the current approach is also efficient because it walks backwards from the latest snapshot, so we can short-cut early.

Copy link
Contributor

@aiborodin aiborodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to allow users to control the scope of validations as implemented in #14514. Users should not need to validate the whole Snapshot history every time.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good, reusable solution to me. Thanks, @danielcweeks! And thanks to @fqaiser94 for the original custom validator idea!

@danielcweeks danielcweeks force-pushed the snapshot-update-validation branch from 0ede749 to 3889ec8 Compare November 6, 2025 23:54
@aiborodin
Copy link
Contributor

Thanks for the change @danielcweeks. It looks good and efficient, and we can now reuse this API in Flink as well.

@danielcweeks danielcweeks merged commit d2551a6 into apache:main Nov 7, 2025
43 checks passed
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @danielcweeks

import javax.annotation.Nonnull;

/**
* Interface to support validating snapshot ancestry during the commit process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor/optional : This won't be called per commit process ? as not all updates produce a new snapshot, do we wanna say commits that produce snapshots ?

* @return boolean for whether the update is valid
*/
@Override
Boolean apply(Iterable<Snapshot> baseSnapshots);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wonder if just the primitive type boolean is fine ? this way it will always to true / false ?

@danielcweeks
Copy link
Contributor Author

Thanks for the reviews/comments @rdblue @aiborodin @bryanck

I'll rebase #14510 on this, but there are also some additional things that impact how we validate in KC.

Also, thanks again to @fqaiser94!

@fqaiser94
Copy link
Contributor

ha, thanks for the ping, glad to see this feature land in iceberg, this will be huge for exactly-once applications!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants