Spark 3.3: Output the net changes across snapshots in CDC #7326

flyrain · 2023-04-11T20:43:17Z

It is common that users want to get the net changes across multiple snapshots. They don't care about what happened in the middle. This PR is to provide an option to output net changes.
cc @aokolnychyi @szehon-ho @RussellSpitzer

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

...ions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/NetChangelogIterator.java

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

anigos · 2023-04-12T01:05:59Z

More importantly, users avoid further compute to determine the intended change on Key(s). This logic can be compute intensive in nature for bigger dimension or fact tables. Thanks @flyrain for prioritising it.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/NetChangelogIterator.java

RussellSpitzer · 2023-04-12T18:25:36Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

        COMPUTE_UPDATES_PARAM,
        REMOVE_CARRYOVERS_PARAM,
        IDENTIFIER_COLUMNS_PARAM,
+        NET_CHANGES,


Is netChanges incompatible with "remove_carryovers"? I believe it basically is setup in this pr that net changes means remove carryovers must be true

Here is a list of options to generate changelog before this PR:

Keeping carry-over rows and not computing pre/post images. I double how useful it is since user can directly use the change table by select * from table.changes

Removing carry-over rows, but still no pre/post images. this is the default behavior.

Removing carry-over rows and computing pre/post images.

In this PR, we added the fourth option

Removing carry-over rows, computing pre/post images, and removing intermediate changes across multiple snapshots.

Does that make sense?

Option 1 can be served as a temp workaround as there is no way to pass options in SQL, so we can't configure snapshot ranges when using changes directly.

Russell does bring a valid point that we can't set compute_updates or remove_carryovers to false if the new flag is true. Let me think.

Maybe, I read it wrong. We take into account net_changes only when computing updates?

I think it is nice to put net_changes under compute_updates, even though we can do net changes without compute updates. In a combination way, we provide less options to users for better UX. Users won’t be like “Wow, there are so many knobs, which one or combination should I use.”

I’m open to suggestion though. Let me know how useful is net changes without compute updates. We can provide this option if we have a strong reason.

Is remove_carryover would be better as a three-way enum: 'keep', 'remove' // per snapshot, 'remove_net'. Right now there's 3 knobs so 8 combos, vs 6? Or did I mis-understand?

Here are combinations:

no remove_carryovers

remove_carryovers, single-snapshot

remove_carryovers, net change/cross-snapshot

compute_update with remove_carryovers, single-snapshot

compute_update with remove_carryovers, net change/cross-snapshot

1, 2, 4 have been released. 3 is this PR. 5 is TBD

@flyrain yea, so I think my suggestion to make RemoveCarryOverMode be a three-way enum, and deprecate RemoveCarryOver boolean.

Currently with three boolean configs, user gets 8 possibilites. With it as a three-way enum (none, net-changes, per-snapshot), user gets 6 possibilites. Extra one being compute_update with no remove_carryover, which we can validate against.

wdyt?

sure, let's discuss this offline a bit.

Deprecated the Remove_CarryOvers option per discussion. The procedure will always remove carryovers. With deprecation, we got 4 options now.

remove_carryovers, single-snapshot

remove_carryovers, net change/cross-snapshot

compute_update with remove_carryovers, single-snapshot

compute_update with remove_carryovers, net change/cross-snapshot

aokolnychyi · 2023-04-13T19:59:07Z

Let me take a look today.

flyrain · 2023-05-19T00:21:33Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

+ *       is 1-to-1 mapping to snapshot id.
+ * </ul>
+ */
+public class RemoveNetCarryoverIterator extends RemoveCarryoverIterator {


We've got this new iterator that can do everything RemoveCarryoverIterator can and more. I'm thinking we could merge them. It'd use more memory since it caches rows. If the memory increase isn't an issue, this could simplify our code. Thoughts?

Why do we need to use the list here? Don't we only need to know the first row and the last?

Look a bit more. We should be able to archive the same without a list. Let me post a new commit soon.

I used the list since we can either cache a delete row or a insert row. But looking a bit more. It is a either-or at any time. We will never have cached rows with mixed insert/delete, since we will remove them as a pair in that case. For example, cached rows could be

(1, 'a', delete) (1, 'a', delete) (1, 'a', delete)

or

(1, 'a', insert) (1, 'a', insert)

But it will never be

(1, 'a', delete) (1, 'a', insert)

With that, a cached row count is good enough. Removed the list in the new commit. This new iterator can do more than its parent with the same cost. I think it is good idea to just keep one. Thoughts?

I think one iterator is fine

let me merge them.

The code becomes entangled when I merge them. Too many if-else clauses. We'd better off leave as it is.

flyrain · 2023-05-19T17:14:01Z

Thanks @RussellSpitzer @amogh-jahagirdar @anigos @aokolnychyi for the review. This is ready for another look.

...ions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestChangelogIterator.java

flyrain · 2023-06-05T23:31:01Z

@RussellSpitzer , thanks a lot for the review. Ready for another look.

flyrain · 2023-06-15T21:29:47Z

cc @szehon-ho

...ions/src/test/java/org/apache/iceberg/spark/extensions/TestCreateChangelogViewProcedure.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

szehon-ho · 2023-06-16T18:59:02Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

+  }
+
+  private int[] generateIndicesToIdentifySameRow() {
+    int changeOrdinalIndex = rowType().fieldIndex(MetadataColumns.CHANGE_ORDINAL.name());


I'm a bit lost why this method is not the same as RemoveCarryoverIterator? Are 'changeOrdinalIndex' and 'snapshotIdIndex' not in the rows of the other iterator?

The indices generated here is used for comparing if two records are the same. In RemoveCarryoverIterator, we consider two records are different if their changeOrdinalIndex and/or snapshotIdIndex are not the same, while we may consider two records are the same in RemoveNetCarryoverIterator even if these two columns are different. We will need the snapshot boundary in RemoveCarryoverIterator since it is only for one single snapshot. For example, we cannot merge the following two rows in RemoveCarryoverIterator, while we can in RemoveNetCarryoverIterator

(1, 'a', insert, 'snapshot-1') (1, 'a', delete, 'snapshot-2')

Cant we share the code by doing the same thing and making a set to identify metadata column index?

private int[] generateIndicesToIdentifySameRow() { Set<Integer> metadataColumnIndices = Sets.newHashSet( rowType().fieldIndex(MetadataColumns.CHANGE_ORDINAL.name()), rowType().fieldIndex(MetadataColumns.COMMIT_SNAPSHOT_ID.name()), changeTypeIndex()); return generateIndicesToIdentifySameRow(metadataColumnIndices); } private int[] generateIndicesToIdentifySameRow(Set<Integer> metadataColumnIndices) { int[] indices = new int[rowType().size() - metadataColumnIndices.size()]; for (int i = 0, j = 0; i < indices.length; i++) { if (!metadataColumnIndices.contains(i)) { indices[j] = i; j++; } } return indices; }

From RemoveCarryoverIterator, the set will be only changeTypeIndex? Let me know if I miss something.

Extract the method out. Thanks for the suggestion.

szehon-ho · 2023-06-16T23:26:36Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

        COMPUTE_UPDATES_PARAM,
        REMOVE_CARRYOVERS_PARAM,
        IDENTIFIER_COLUMNS_PARAM,
+        NET_CHANGES,


Is remove_carryover would be better as a three-way enum: 'keep', 'remove' // per snapshot, 'remove_net'. Right now there's 3 knobs so 8 combos, vs 6? Or did I mis-understand?

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

szehon-ho · 2023-06-21T01:29:50Z

Also one more comment, we should make this change on 3.4 as well (latest Spark version)

flyrain · 2023-06-22T18:38:45Z

Also one more comment, we should make this change on 3.4 as well (latest Spark version)

definitely, will add it to 3.4 right after this is merged.

flyrain · 2023-06-22T21:38:27Z

Thanks a lot for the review, @szehon-ho! Resolved the comments, and it's ready for another look.

szehon-ho · 2023-06-22T23:31:32Z

gradle.properties

 systemProp.defaultHiveVersions=2
 systemProp.knownHiveVersions=2,3
-systemProp.defaultSparkVersions=3.4
+systemProp.defaultSparkVersions=3.4,3.3


Unncessary change

Will remove it.

szehon-ho · 2023-06-22T23:47:30Z

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java

        COMPUTE_UPDATES_PARAM,
        REMOVE_CARRYOVERS_PARAM,
        IDENTIFIER_COLUMNS_PARAM,
+        NET_CHANGES,


@flyrain yea, so I think my suggestion to make RemoveCarryOverMode be a three-way enum, and deprecate RemoveCarryOver boolean.

Currently with three boolean configs, user gets 8 possibilites. With it as a three-way enum (none, net-changes, per-snapshot), user gets 6 possibilites. Extra one being compute_update with no remove_carryover, which we can validate against.

wdyt?

szehon-ho · 2023-06-22T23:57:50Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

+        nextRow = null;
+      } else {
+        // two rows with same change types means potential net changes
+        nextRow = null;


I mean, let's move it out of the if/else. Even Intellij suggests 'Common part can be extracted from 'if'

szehon-ho · 2023-06-23T00:03:12Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

+      return currentRow;
+    }
+
+    Row nextRow = rowIterator().next();


Maybe not clear:

@Override public Row next() { // if there are cached rows, return one of them from the beginning if (cachedRowCount > 0) { cachedRowCount--; return cachedRow; } this.cachedRow = getCurrentRow(); // return it directly if the current row is the last row if (!rowIterator().hasNext()) { return cachedRow; } this.cachedNextRow = rowIterator().next(); cachedRowCount = 1; // pull rows from the iterator until two consecutive rows are different while (isSameRecord(cachedRow, cachedNextRow)) { if (oppositeChangeType(cachedRow, cachedNextRow)) { // two rows with opposite change types means no net changes, remove both cachedRowCount--; } else { // two rows with same change types means potential net changes, cache the next row, reset it // to null cachedRowCount++; } // stop pulling rows if there is no more rows or the next row is different if (cachedRowCount <= 0 || !rowIterator().hasNext()) { this.cachedNextRow = null; break; } this.cachedNextRow = rowIterator().next(); } return null; }

I think it will work to remove 'currentRow'. But I am not sure if I am too aggressive for removing 'nextRow'. Please check if I made a mistake.

szehon-ho · 2023-06-23T00:09:22Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java


-  private boolean isSameRecord(Row currentRow, Row nextRow) {
-    for (int idx : indicesToIdentifySameRow) {
+  protected boolean isSameRecord(Row currentRow, Row nextRow) {


Code-wise, I am thinking that it is a bit harder to read to have RemoveNetCarryoverIterator to extend RemoveCarryoverIterator. So was thinking we can move have an extra base class, then its easier to see what is the different methods and what is same, not sure what you think. Now its a bit trickier to see that.

szehon-ho · 2023-06-23T00:16:54Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

+  }
+
+  private int[] generateIndicesToIdentifySameRow() {
+    int changeOrdinalIndex = rowType().fieldIndex(MetadataColumns.CHANGE_ORDINAL.name());


Cant we share the code by doing the same thing and making a set to identify metadata column index?

private int[] generateIndicesToIdentifySameRow() { Set<Integer> metadataColumnIndices = Sets.newHashSet( rowType().fieldIndex(MetadataColumns.CHANGE_ORDINAL.name()), rowType().fieldIndex(MetadataColumns.COMMIT_SNAPSHOT_ID.name()), changeTypeIndex()); return generateIndicesToIdentifySameRow(metadataColumnIndices); } private int[] generateIndicesToIdentifySameRow(Set<Integer> metadataColumnIndices) { int[] indices = new int[rowType().size() - metadataColumnIndices.size()]; for (int i = 0, j = 0; i < indices.length; i++) { if (!metadataColumnIndices.contains(i)) { indices[j] = i; j++; } } return indices; }

From RemoveCarryoverIterator, the set will be only changeTypeIndex? Let me know if I miss something.

flyrain · 2023-06-27T20:30:16Z

The test failure is unrelated. It failed in Spark3.4, which is not touched by this PR

TestStructuredStreamingRead3 > [catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] > testReadingStreamFromFutureTimetsamp[catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] FAILED
    org.opentest4j.AssertionFailedError: 
    Expecting value to be true but was false
        at [email protected]/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at [email protected]/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

flyrain · 2023-06-27T22:14:26Z

Thanks a lot for the review, @szehon-ho. Ready for another look.

szehon-ho

Hey looks good, had a few style nits

szehon-ho · 2023-06-29T17:39:16Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ComputeUpdateIterator.java

    Row currentRow = currentRow();

-    if (currentRow.getString(changeTypeIndex()).equals(DELETE) && rowIterator().hasNext()) {
+    if (changeType(currentRow).equals(DELETE) && rowIterator().hasNext()) {


Nit, can we reverse the order of equals, to reduce chance of NPE? (and for the other places in the method)

That makes sense, but it sacrifice the readability. I will add a null check here. We are going to fail it if it is null.

protected String changeType(Row row) { String changeType = row.getString(changeTypeIndex()); Preconditions.checkNotNull(changeType, "Change type should not be null"); return changeType; }

szehon-ho · 2023-06-29T17:54:39Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java

  RemoveCarryoverIterator(Iterator<Row> rowIterator, StructType rowType) {
    super(rowIterator, rowType);
-    this.indicesToIdentifySameRow = generateIndicesToIdentifySameRow(rowType.size());
+    this.rowType = rowType;


Nit: if we pass the variable to the super class , can we just store it in super class and get it via super method? (I think cleaner if subclass has only fields that it only knows about).

Also looks like both subclass do this.

szehon-ho · 2023-06-29T17:54:59Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveCarryoverIterator.java


    // If the current row is a delete row, drain all identical delete rows
-    if (currentRow.getString(changeTypeIndex()).equals(DELETE) && rowIterator().hasNext()) {
+    if (changeType(currentRow).equals(DELETE) && rowIterator().hasNext()) {


Same comment to change the order (I guess its from a previous change, but while we are here)

szehon-ho · 2023-06-29T17:56:37Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/RemoveNetCarryoverIterator.java

+  private final int[] indicesToIdentifySameRow;
+  private final StructType rowType;
+
+  private Row cachedNextRow = null;


Nit: rely on java default

flyrain · 2023-06-29T22:32:40Z

Thanks @anigos @aokolnychyi @RussellSpitzer @szehon-ho for the review!

flyrain requested a review from aokolnychyi April 11, 2023 20:43

github-actions bot added the spark label Apr 11, 2023

amogh-jahagirdar reviewed Apr 11, 2023

View reviewed changes

....3/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java Show resolved Hide resolved

RussellSpitzer reviewed Apr 12, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/NetChangelogIterator.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 12, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/NetChangelogIterator.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 12, 2023

View reviewed changes

flyrain added 3 commits May 16, 2023 14:27

Output the net changes across snapshots in CDC

978c396

add tests

ccdcde9

add tests

dd95e5f

flyrain force-pushed the cdc-across-snapshot branch from 675f38f to dd95e5f Compare May 19, 2023 00:13

flyrain commented May 19, 2023

View reviewed changes

flyrain added 3 commits May 18, 2023 18:09

Fix the test failures

79efbaf

Fix the test failures

cec8dbf

Add tests

10292fc

Use cache row count instead of list to reduce memory usage

516040f