Skip to content

Conversation

@flyrain
Copy link
Contributor

@flyrain flyrain commented May 5, 2022

Extracted the interface change from the draft PR #4539. According the MVP design and recent community discussion, we are going to create an action interface first for Change Data Capture(CDC).

cc @aokolnychyi @RussellSpitzer @szehon-ho @jackye1995 @kbendick @karuppayya @chenjunjiedada @stevenzwu @rdblue @Reo-LEI @hameizi @singhpk234 @rajarshisarkar

*
* @return this for method chaining
*/
GenerateChangeSet forCurrentSnapshot();
Copy link
Contributor Author

@flyrain flyrain May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original discussion to remove this API is here, #4539 (comment). I think it is still useful as a shortcut of forSnapshot(table.currentSnapshot().snapshotId()), like a syntactic sugar of a programming language. It is not worth if it causes any confusion, which I don't think it will. I'm open to any suggestion though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how frequently folks will be calling this. Is there a particular use case? I thought folks would usually have a starting point and would be interested to see changes since that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not so frequent. I will remove this. We can always add it back if this is useful.

GenerateChangeSet afterSnapshot(long fromSnapshotId);

/**
* Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Suggested change
* Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive).
* Emit changed data set from the start snapshot (exclusive) to the end snapshot (inclusive).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @singhpk234, I meant to use change as a noun here. So it is a table change, which is presented as a data set, instead of a data set of the table which has been changed.

@flyrain flyrain changed the title API: Add an action GenerateChangeSet API: Add an action to generate table change set May 9, 2022
@rdblue
Copy link
Contributor

rdblue commented May 10, 2022

@flyrain, is this really an "Action"? What if we had a separate API for helpers like this that wasn't called "actions"? Then we wouldn't need to add and remove things from actions. I'd prefer to keep it separate, though I'm open to hearing the arguments for mixing this in.

@flyrain
Copy link
Contributor Author

flyrain commented May 10, 2022

@rdblue, @aokolnychyi and me have planned a separated PR for the CDC scan API, which will be more general, can be used across engines.

What if we had a separate API for helpers like this that wasn't called "actions"?

Is the helper the CDC scan API, or an abstract(interface) of the CDC action and the CDC scan API? I think the later makes more sense, which I image is a pretty thin layer. Is it necessary?

@aokolnychyi
Copy link
Contributor

Let me take a look. Sorry for the delay on my side.


package org.apache.iceberg.actions;

public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a few sentences about the purpose of this action similar to other actions?

*
* @return this for method chaining
*/
GenerateChangeSet forCurrentSnapshot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how frequently folks will be calling this. Is there a particular use case? I thought folks would usually have a starting point and would be interested to see changes since that.

* @param fromSnapshotId id of the start snapshot
* @return this for method chaining
*/
GenerateChangeSet afterSnapshot(long fromSnapshotId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming seems good to me but I wonder whether we should follow whatever was recently added in #4580. One thing I like about that API is using explicit inclusive/exclusive words whenever configuring the start snapshot.

GenerateChangeSet fromSnapshotInclusive(long fromSnapshotId);
GenerateChangeSet fromSnapshotExclusive(long fromSnapshotId);
GenerateChangeSet toSnapshot(long toSnapshotId);

Thoughts, @flyrain @stevenzwu @szehon-ho @rdblue @RussellSpitzer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea to use the same name as #4580, will make the change if there is no objection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change in the new commit.

/**
* The action result that contains a dataset of changed rows.
*/
interface Result<T> {
Copy link
Contributor

@aokolnychyi aokolnychyi May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't think the action API should be limited to table maintenance. When we added it, the idea was to provide solutions for common problems that require a query engine. Initially, those happened to be mostly related to table maintenance. However, I wouldn't mind actions for other purposes.

We all agree to share the planning part across engines via a dedicated scan API. However, we still need to build an engine-specific representation somewhere. Of course, we could have a utility but I am not a big fan of exposing utilities to users. The action API is much more user-facing and requires us to think about the proper interface and keep the compatibility. We had many issues with Spark utils exposed to the users in the past.

One thing about using the action API for CDC that concerns me is the need to parameterize the result. Our action API was engine agnostic so far but we need to return Spark Dataset here. I am not sure how big of a deal it is.

Thoughts, @flyrain @rdblue @RussellSpitzer @szehon-ho @stevenzwu?

@rdblue @stevenzwu, I think both of you had concerns about using an action. Could you elaborate a little bit? Given that we will have a common Scan API to share the planning logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative we discussed is using a new metadata table so that users can point to it to load changes. What is our long-term plan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi My question would be why not exposing the CDC read (change set) via a source (in Spark or other engines). I remember @rdblue mentioned some problem with Spark source for that purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu, do you mean a custom source for CDC records or just extending the main data source integration? Spark, unlike Flink, does not have an API for CDC which can be used by data sources. We would want to add that to Spark eventually but it will be a major effort and will take tons of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for explaining. I guess that is the problem Ryan mentioned. Spark source API doesn't support CDC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both of you had concerns about using an action. Could you elaborate a little bit?

I don't have a problem with using Actions for other tasks, but I think those other tasks should be actions taken on the table. It seems odd to me to have an "action" be something that essentially builds a view of a table that is evaluated lazily. In this case, the action doesn't really do anything. It is a convenience.

That's why the return value doesn't really fit. Normally, you get back a summary of what was done, but in this case nothing was done and you get back the resulting dataframe.

The other good point is what Steven raised. I would like to make this a source so that you can use it as natively in Spark as possible. We can use the same approach as time travel and metadata tables. If you load db.table.change_log, then you get the CDC view of the table. And you can use time travel selectors (from-snapshot-id, to-snapshot-id, etc) to select the range of time. I'd definitely prefer this more native approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata table came up a few times in the original discussion but we were worried about the extra complexity its would require and how much this would delay the CDC implementation. Let's estimate a little bit the amount of changes that will be required to support a metadata table approach. We can start with batch support only for now.

Copy link
Member

@szehon-ho szehon-ho May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in late, but I like metadata table way as well, as was discussing with @flyrain, though I might be missing some context.

To me it's a better UX (user can use SQL to query and do analytic on it, time-travel). And code wise it may make the integration with other engine like Trino easier as there's some common interfaces exposed across metadata tables they may take advantage of instead of a separate Action interface, though I admit that's a bit theoretical at this point.

Copy link
Contributor Author

@flyrain flyrain May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no doubt that metadata table is better on UX. We've planned the metadata table in the phase 2, check the milestone part in the design doc. The idea was to bring the basic CDC functionality to users sooner to unblock them. Some users really want it ASAP.

@aokolnychyi
Copy link
Contributor

PR #4870 is one way to design the CDC scan. Let me know what everybody thinks.

@flyrain
Copy link
Contributor Author

flyrain commented Jun 17, 2022

Close this. PR #4870 is the way to go.

@flyrain flyrain closed this Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants