-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Add an action to generate table change set #4708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * | ||
| * @return this for method chaining | ||
| */ | ||
| GenerateChangeSet forCurrentSnapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original discussion to remove this API is here, #4539 (comment). I think it is still useful as a shortcut of forSnapshot(table.currentSnapshot().snapshotId()), like a syntactic sugar of a programming language. It is not worth if it causes any confusion, which I don't think it will. I'm open to any suggestion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how frequently folks will be calling this. Is there a particular use case? I thought folks would usually have a starting point and would be interested to see changes since that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not so frequent. I will remove this. We can always add it back if this is useful.
| GenerateChangeSet afterSnapshot(long fromSnapshotId); | ||
|
|
||
| /** | ||
| * Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]
| * Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive). | |
| * Emit changed data set from the start snapshot (exclusive) to the end snapshot (inclusive). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @singhpk234, I meant to use change as a noun here. So it is a table change, which is presented as a data set, instead of a data set of the table which has been changed.
|
@flyrain, is this really an "Action"? What if we had a separate API for helpers like this that wasn't called "actions"? Then we wouldn't need to add and remove things from actions. I'd prefer to keep it separate, though I'm open to hearing the arguments for mixing this in. |
|
@rdblue, @aokolnychyi and me have planned a separated PR for the CDC scan API, which will be more general, can be used across engines.
Is the helper the CDC scan API, or an abstract(interface) of the CDC action and the CDC scan API? I think the later makes more sense, which I image is a pretty thin layer. Is it necessary? |
|
Let me take a look. Sorry for the delay on my side. |
|
|
||
| package org.apache.iceberg.actions; | ||
|
|
||
| public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a few sentences about the purpose of this action similar to other actions?
| * | ||
| * @return this for method chaining | ||
| */ | ||
| GenerateChangeSet forCurrentSnapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how frequently folks will be calling this. Is there a particular use case? I thought folks would usually have a starting point and would be interested to see changes since that.
| * @param fromSnapshotId id of the start snapshot | ||
| * @return this for method chaining | ||
| */ | ||
| GenerateChangeSet afterSnapshot(long fromSnapshotId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming seems good to me but I wonder whether we should follow whatever was recently added in #4580. One thing I like about that API is using explicit inclusive/exclusive words whenever configuring the start snapshot.
GenerateChangeSet fromSnapshotInclusive(long fromSnapshotId);
GenerateChangeSet fromSnapshotExclusive(long fromSnapshotId);
GenerateChangeSet toSnapshot(long toSnapshotId);
Thoughts, @flyrain @stevenzwu @szehon-ho @rdblue @RussellSpitzer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea to use the same name as #4580, will make the change if there is no objection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change in the new commit.
| /** | ||
| * The action result that contains a dataset of changed rows. | ||
| */ | ||
| interface Result<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't think the action API should be limited to table maintenance. When we added it, the idea was to provide solutions for common problems that require a query engine. Initially, those happened to be mostly related to table maintenance. However, I wouldn't mind actions for other purposes.
We all agree to share the planning part across engines via a dedicated scan API. However, we still need to build an engine-specific representation somewhere. Of course, we could have a utility but I am not a big fan of exposing utilities to users. The action API is much more user-facing and requires us to think about the proper interface and keep the compatibility. We had many issues with Spark utils exposed to the users in the past.
One thing about using the action API for CDC that concerns me is the need to parameterize the result. Our action API was engine agnostic so far but we need to return Spark Dataset here. I am not sure how big of a deal it is.
Thoughts, @flyrain @rdblue @RussellSpitzer @szehon-ho @stevenzwu?
@rdblue @stevenzwu, I think both of you had concerns about using an action. Could you elaborate a little bit? Given that we will have a common Scan API to share the planning logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative we discussed is using a new metadata table so that users can point to it to load changes. What is our long-term plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi My question would be why not exposing the CDC read (change set) via a source (in Spark or other engines). I remember @rdblue mentioned some problem with Spark source for that purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenzwu, do you mean a custom source for CDC records or just extending the main data source integration? Spark, unlike Flink, does not have an API for CDC which can be used by data sources. We would want to add that to Spark eventually but it will be a major effort and will take tons of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for explaining. I guess that is the problem Ryan mentioned. Spark source API doesn't support CDC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both of you had concerns about using an action. Could you elaborate a little bit?
I don't have a problem with using Actions for other tasks, but I think those other tasks should be actions taken on the table. It seems odd to me to have an "action" be something that essentially builds a view of a table that is evaluated lazily. In this case, the action doesn't really do anything. It is a convenience.
That's why the return value doesn't really fit. Normally, you get back a summary of what was done, but in this case nothing was done and you get back the resulting dataframe.
The other good point is what Steven raised. I would like to make this a source so that you can use it as natively in Spark as possible. We can use the same approach as time travel and metadata tables. If you load db.table.change_log, then you get the CDC view of the table. And you can use time travel selectors (from-snapshot-id, to-snapshot-id, etc) to select the range of time. I'd definitely prefer this more native approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata table came up a few times in the original discussion but we were worried about the extra complexity its would require and how much this would delay the CDC implementation. Let's estimate a little bit the amount of changes that will be required to support a metadata table approach. We can start with batch support only for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to chime in late, but I like metadata table way as well, as was discussing with @flyrain, though I might be missing some context.
To me it's a better UX (user can use SQL to query and do analytic on it, time-travel). And code wise it may make the integration with other engine like Trino easier as there's some common interfaces exposed across metadata tables they may take advantage of instead of a separate Action interface, though I admit that's a bit theoretical at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no doubt that metadata table is better on UX. We've planned the metadata table in the phase 2, check the milestone part in the design doc. The idea was to bring the basic CDC functionality to users sooner to unblock them. Some users really want it ASAP.
|
PR #4870 is one way to design the CDC scan. Let me know what everybody thinks. |
|
Close this. PR #4870 is the way to go. |
Extracted the interface change from the draft PR #4539. According the MVP design and recent community discussion, we are going to create an action interface first for Change Data Capture(CDC).
cc @aokolnychyi @RussellSpitzer @szehon-ho @jackye1995 @kbendick @karuppayya @chenjunjiedada @stevenzwu @rdblue @Reo-LEI @hameizi @singhpk234 @rajarshisarkar