Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: revert delete procedure #11084

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bryanck
Copy link
Contributor

@bryanck bryanck commented Sep 5, 2024

This PR adds a new Spark procedure named revert_delete that can be used to add back files that were removed by a delete operation. It takes a table name and a snapshot ID as parameters. The snapshot ID must be for a snapshot that was a delete operation, otherwise an exception will be thrown. Also, if the operation was previously reverted, then an exception will be thrown.

This procedure can be used to revert a delete operation that occurred in the past, after new snapshots have already been applied. In this case, a simple rollback will cause a loss of data from the new snapshots. For example, a maintenance operation may delete records based on an invalid TTL setting and that might not be immediately discovered by the table owner. In this case, new operations will continue to be performed on the table after the delete.

Example Spark SQL syntax: call catalog.system.revert_delete('db.table', 123456)

One limitation is that merge-on-read deletes cannot be reverted. Another is that branches (other than main) are not currently supported, though this could be added easily in a follow-up if desired.

@github-actions github-actions bot added the spark label Sep 5, 2024
@bryanck bryanck force-pushed the revert-delete-proc branch 6 times, most recently from 58cb3b9 to 46436e0 Compare September 9, 2024 15:58
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/** A procedure that adds back files to a table that were removed in a given delete operation. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should have a more complete description here including the requirements (e.g. the snapshot must be a delete operation, it must only include data files, etc.) and how it works (appends the files to a new snapshot, which will have a new snapshot id).

table.refresh();
assertThat(spark.table(tableName).count()).isEqualTo(2);
assertThat(table.snapshots()).hasSize(4);
assertThat(table.currentSnapshot().operation()).isEqualTo(DataOperations.APPEND);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the data files that were re-added, does the sequence number stay the same? I think we should check that too so that dropping and adding files doesn't conflict with other positional deletes or equality deletes.


// Ensure we don't have deltas...
String addedDeleteFiles =
snapshot.summary().getOrDefault(SnapshotSummary.ADDED_DELETE_FILES_PROP, "0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to check that there were no deleted positional/equality delete files? Allowing a restore in that case would possibly change the data of the snapshot, so probably best to exclude that type of delete snapshot as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants