API: Add an action to generate table change set #4708

flyrain · 2022-05-05T21:58:46Z

Extracted the interface change from the draft PR #4539. According the MVP design and recent community discussion, we are going to create an action interface first for Change Data Capture(CDC).

cc @aokolnychyi @RussellSpitzer @szehon-ho @jackye1995 @kbendick @karuppayya @chenjunjiedada @stevenzwu @rdblue @Reo-LEI @hameizi @singhpk234 @rajarshisarkar

flyrain · 2022-05-05T22:08:51Z

api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java

+   *
+   * @return this for method chaining
+   */
+  GenerateChangeSet forCurrentSnapshot();


The original discussion to remove this API is here, #4539 (comment). I think it is still useful as a shortcut of forSnapshot(table.currentSnapshot().snapshotId()), like a syntactic sugar of a programming language. It is not worth if it causes any confusion, which I don't think it will. I'm open to any suggestion though.

I am not sure how frequently folks will be calling this. Is there a particular use case? I thought folks would usually have a starting point and would be interested to see changes since that.

Probably not so frequent. I will remove this. We can always add it back if this is useful.

singhpk234 · 2022-05-07T18:27:16Z

api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java

+  GenerateChangeSet afterSnapshot(long fromSnapshotId);
+
+  /**
+   * Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive).


[nit]

Suggested change

* Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive).

* Emit changed data set from the start snapshot (exclusive) to the end snapshot (inclusive).

Hi @singhpk234, I meant to use change as a noun here. So it is a table change, which is presented as a data set, instead of a data set of the table which has been changed.

rdblue · 2022-05-10T02:32:07Z

@flyrain, is this really an "Action"? What if we had a separate API for helpers like this that wasn't called "actions"? Then we wouldn't need to add and remove things from actions. I'd prefer to keep it separate, though I'm open to hearing the arguments for mixing this in.

flyrain · 2022-05-10T17:48:44Z

@rdblue, @aokolnychyi and me have planned a separated PR for the CDC scan API, which will be more general, can be used across engines.

What if we had a separate API for helpers like this that wasn't called "actions"?

Is the helper the CDC scan API, or an abstract(interface) of the CDC action and the CDC scan API? I think the later makes more sense, which I image is a pretty thin layer. Is it necessary?

aokolnychyi · 2022-05-16T20:16:05Z

Let me take a look. Sorry for the delay on my side.

aokolnychyi · 2022-05-16T20:24:17Z

api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java

+
+package org.apache.iceberg.actions;
+
+public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> {


Shall we add a few sentences about the purpose of this action similar to other actions?

aokolnychyi · 2022-05-16T20:27:57Z

api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java

+   *
+   * @return this for method chaining
+   */
+  GenerateChangeSet forCurrentSnapshot();


I am not sure how frequently folks will be calling this. Is there a particular use case? I thought folks would usually have a starting point and would be interested to see changes since that.

aokolnychyi · 2022-05-16T20:32:01Z

api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java

+   * @param fromSnapshotId id of the start snapshot
+   * @return this for method chaining
+   */
+  GenerateChangeSet afterSnapshot(long fromSnapshotId);


This naming seems good to me but I wonder whether we should follow whatever was recently added in #4580. One thing I like about that API is using explicit inclusive/exclusive words whenever configuring the start snapshot.

GenerateChangeSet fromSnapshotInclusive(long fromSnapshotId); GenerateChangeSet fromSnapshotExclusive(long fromSnapshotId); GenerateChangeSet toSnapshot(long toSnapshotId);

Thoughts, @flyrain @stevenzwu @szehon-ho @rdblue @RussellSpitzer?

I like the idea to use the same name as #4580, will make the change if there is no objection.

Made the change in the new commit.

aokolnychyi · 2022-05-16T20:42:36Z

api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java

+  /**
+   * The action result that contains a dataset of changed rows.
+   */
+  interface Result<T> {


I personally don't think the action API should be limited to table maintenance. When we added it, the idea was to provide solutions for common problems that require a query engine. Initially, those happened to be mostly related to table maintenance. However, I wouldn't mind actions for other purposes.

We all agree to share the planning part across engines via a dedicated scan API. However, we still need to build an engine-specific representation somewhere. Of course, we could have a utility but I am not a big fan of exposing utilities to users. The action API is much more user-facing and requires us to think about the proper interface and keep the compatibility. We had many issues with Spark utils exposed to the users in the past.

One thing about using the action API for CDC that concerns me is the need to parameterize the result. Our action API was engine agnostic so far but we need to return Spark Dataset here. I am not sure how big of a deal it is.

Thoughts, @flyrain @rdblue @RussellSpitzer @szehon-ho @stevenzwu?

@rdblue @stevenzwu, I think both of you had concerns about using an action. Could you elaborate a little bit? Given that we will have a common Scan API to share the planning logic.

Another alternative we discussed is using a new metadata table so that users can point to it to load changes. What is our long-term plan?

@aokolnychyi My question would be why not exposing the CDC read (change set) via a source (in Spark or other engines). I remember @rdblue mentioned some problem with Spark source for that purpose.

@stevenzwu, do you mean a custom source for CDC records or just extending the main data source integration? Spark, unlike Flink, does not have an API for CDC which can be used by data sources. We would want to add that to Spark eventually but it will be a major effort and will take tons of time.

thx for explaining. I guess that is the problem Ryan mentioned. Spark source API doesn't support CDC.

I think both of you had concerns about using an action. Could you elaborate a little bit?

I don't have a problem with using Actions for other tasks, but I think those other tasks should be actions taken on the table. It seems odd to me to have an "action" be something that essentially builds a view of a table that is evaluated lazily. In this case, the action doesn't really do anything. It is a convenience.

That's why the return value doesn't really fit. Normally, you get back a summary of what was done, but in this case nothing was done and you get back the resulting dataframe.

The other good point is what Steven raised. I would like to make this a source so that you can use it as natively in Spark as possible. We can use the same approach as time travel and metadata tables. If you load db.table.change_log, then you get the CDC view of the table. And you can use time travel selectors (from-snapshot-id, to-snapshot-id, etc) to select the range of time. I'd definitely prefer this more native approach.

The metadata table came up a few times in the original discussion but we were worried about the extra complexity its would require and how much this would delay the CDC implementation. Let's estimate a little bit the amount of changes that will be required to support a metadata table approach. We can start with batch support only for now.

Sorry to chime in late, but I like metadata table way as well, as was discussing with @flyrain, though I might be missing some context.

To me it's a better UX (user can use SQL to query and do analytic on it, time-travel). And code wise it may make the integration with other engine like Trino easier as there's some common interfaces exposed across metadata tables they may take advantage of instead of a separate Action interface, though I admit that's a bit theoretical at this point.

There is no doubt that metadata table is better on UX. We've planned the metadata table in the phase 2, check the milestone part in the design doc. The idea was to bring the basic CDC functionality to users sooner to unblock them. Some users really want it ASAP.

aokolnychyi · 2022-05-25T15:54:52Z

PR #4870 is one way to design the CDC scan. Let me know what everybody thinks.

flyrain · 2022-06-17T21:35:16Z

Close this. PR #4870 is the way to go.

API: Add action GenerateChangeSet

213c234

github-actions bot added API core labels May 5, 2022

flyrain commented May 5, 2022

View reviewed changes

singhpk234 reviewed May 9, 2022

View reviewed changes

flyrain changed the title ~~API: Add an action GenerateChangeSet~~ API: Add an action to generate table change set May 9, 2022

aokolnychyi reviewed May 16, 2022

View reviewed changes

Resolve comments.

0f25b01

flyrain closed this Jun 17, 2022

	* Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive).
	* Emit changed data set from the start snapshot (exclusive) to the end snapshot (inclusive).


		package org.apache.iceberg.actions;

		public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> {

API: Add an action to generate table change set #4708

API: Add an action to generate table change set #4708

Uh oh!

Conversation

flyrain commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 10, 2022

Uh oh!

flyrain commented May 10, 2022

Uh oh!

aokolnychyi commented May 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented May 25, 2022

Uh oh!

flyrain commented Jun 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

flyrain commented May 5, 2022 •

edited

Loading

flyrain May 5, 2022 •

edited

Loading

aokolnychyi May 16, 2022 •

edited

Loading

szehon-ho May 18, 2022 •

edited

Loading

flyrain May 19, 2022 •

edited

Loading