Spark: revert delete procedure #11084

bryanck · 2024-09-05T14:09:13Z

This PR adds a new Spark procedure named revert_delete that can be used to add back files that were removed by a delete operation. It takes a table name and a snapshot ID as parameters. The snapshot ID must be for a snapshot that was a delete operation, otherwise an exception will be thrown. Also, if the operation was previously reverted, then an exception will be thrown.

This procedure can be used to revert a delete operation that occurred in the past, after new snapshots have already been applied. In this case, a simple rollback will cause a loss of data from the new snapshots. For example, a maintenance operation may delete records based on an invalid TTL setting and that might not be immediately discovered by the table owner. In this case, new operations will continue to be performed on the table after the delete.

Example Spark SQL syntax: call catalog.system.revert_delete('db.table', 123456)

One limitation is that merge-on-read deletes cannot be reverted. Another is that branches (other than main) are not currently supported, though this could be added easily in a follow-up if desired.

danielcweeks · 2024-09-17T21:25:58Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RevertDeleteProcedure.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** A procedure that adds back files to a table that were removed in a given delete operation. */


I feel like we should have a more complete description here including the requirements (e.g. the snapshot must be a delete operation, it must only include data files, etc.) and how it works (appends the files to a new snapshot, which will have a new snapshot id).

danielcweeks · 2024-09-17T21:27:27Z

...-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRevertDeleteProcedure.java

+    table.refresh();
+    assertThat(spark.table(tableName).count()).isEqualTo(2);
+    assertThat(table.snapshots()).hasSize(4);
+    assertThat(table.currentSnapshot().operation()).isEqualTo(DataOperations.APPEND);


For the data files that were re-added, does the sequence number stay the same? I think we should check that too so that dropping and adding files doesn't conflict with other positional deletes or equality deletes.

danielcweeks · 2024-09-17T21:29:01Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RevertDeleteProcedure.java

+
+    // Ensure we don't have deltas...
+    String addedDeleteFiles =
+        snapshot.summary().getOrDefault(SnapshotSummary.ADDED_DELETE_FILES_PROP, "0");


Do we also want to check that there were no deleted positional/equality delete files? Allowing a restore in that case would possibly change the data of the snapshot, so probably best to exclude that type of delete snapshot as well.

github-actions bot added the spark label Sep 5, 2024

bryanck force-pushed the revert-delete-proc branch 6 times, most recently from 58cb3b9 to 46436e0 Compare September 9, 2024 15:58

danielcweeks reviewed Sep 17, 2024

View reviewed changes

bryanck force-pushed the revert-delete-proc branch from 8418f88 to cb5fbb7 Compare September 18, 2024 02:12

bryanck added 2 commits September 18, 2024 03:02

Spark: revert delete procedure

808ef3e

precondition checks

74a8632

bryanck force-pushed the revert-delete-proc branch from cb5fbb7 to 74a8632 Compare September 18, 2024 10:02

remove unneeded throws

41c84aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: revert delete procedure #11084

Spark: revert delete procedure #11084

bryanck commented Sep 5, 2024 •

edited

Loading

danielcweeks Sep 17, 2024

danielcweeks Sep 17, 2024

danielcweeks Sep 17, 2024

Spark: revert delete procedure #11084

Are you sure you want to change the base?

Spark: revert delete procedure #11084

Conversation

bryanck commented Sep 5, 2024 • edited Loading

danielcweeks Sep 17, 2024

Choose a reason for hiding this comment

danielcweeks Sep 17, 2024

Choose a reason for hiding this comment

danielcweeks Sep 17, 2024

Choose a reason for hiding this comment

bryanck commented Sep 5, 2024 •

edited

Loading