-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add SnapshotUpdateValidator to validate snapshots on commit #14509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SnapshotUpdateValidator to validate snapshots on commit #14509
Conversation
api/src/main/java/org/apache/iceberg/SnapshotUpdateValidator.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/SnapshotUpdateValidator.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestSnapshotProducer.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestSnapshotProducer.java
Outdated
Show resolved
Hide resolved
|
Thank you @danielcweeks for the change and @rdblue for the review. I implemented an alternative API that works in Kafka Connect and reuses most of the existing validation code. It is in the following two PRs:
@rdblue @danielcweeks Could you please take a look and let me know what you think? |
4dec5aa to
9164f73
Compare
8196a92 to
7ab8370
Compare
7ab8370 to
c3d49e4
Compare
| * @return boolean for whether the update is valid | ||
| */ | ||
| @Override | ||
| Boolean apply(Iterable<Snapshot> baseSnapshots); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to pass the full Snapshot history to fix the Kafka Connect issue. This method can accept only new Snapshots as implemented in #14515.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think what you have in #14515 is quite correct. There's no guarantee (and it's very commonly not the case) that the offsets will be in the prior commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee (and it's very commonly not the case) that the offsets will be in the prior commit.
#14515 checks all new commits when configured with the starting snapshot id.
Wouldn't the current approach result in unnecessary checks of the whole Snapshot history on every validation run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked the implementation of SnapshotUtil::ancestorsOf() - the current approach is also efficient because it walks backwards from the latest snapshot, so we can short-cut early.
aiborodin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to allow users to control the scope of validations as implemented in #14514. Users should not need to validate the whole Snapshot history every time.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good, reusable solution to me. Thanks, @danielcweeks! And thanks to @fqaiser94 for the original custom validator idea!
api/src/main/java/org/apache/iceberg/SnapshotAncestryValidator.java
Outdated
Show resolved
Hide resolved
0ede749 to
3889ec8
Compare
|
Thanks for the change @danielcweeks. It looks good and efficient, and we can now reuse this API in Flink as well. |
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @danielcweeks
| import javax.annotation.Nonnull; | ||
|
|
||
| /** | ||
| * Interface to support validating snapshot ancestry during the commit process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor/optional : This won't be called per commit process ? as not all updates produce a new snapshot, do we wanna say commits that produce snapshots ?
| * @return boolean for whether the update is valid | ||
| */ | ||
| @Override | ||
| Boolean apply(Iterable<Snapshot> baseSnapshots); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wonder if just the primitive type boolean is fine ? this way it will always to true / false ?
|
Thanks for the reviews/comments @rdblue @aiborodin @bryanck I'll rebase #14510 on this, but there are also some additional things that impact how we validate in KC. Also, thanks again to @fqaiser94! |
|
ha, thanks for the ping, glad to see this feature land in iceberg, this will be huge for exactly-once applications! |
This is an alternative to #14503 which introduces validation at the
TableMetadataand provides the most information to a client as to what the refreshed and updated state is.This PR exposes only the snapshot ancestry for the base and updated metadata, which fits more cleanly with the public API, but is more limited in terms of what can be validated.