-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: revert delete procedure #11084
base: main
Are you sure you want to change the base?
Conversation
58cb3b9
to
46436e0
Compare
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
/** A procedure that adds back files to a table that were removed in a given delete operation. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should have a more complete description here including the requirements (e.g. the snapshot must be a delete operation, it must only include data files, etc.) and how it works (appends the files to a new snapshot, which will have a new snapshot id).
table.refresh(); | ||
assertThat(spark.table(tableName).count()).isEqualTo(2); | ||
assertThat(table.snapshots()).hasSize(4); | ||
assertThat(table.currentSnapshot().operation()).isEqualTo(DataOperations.APPEND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the data files that were re-added, does the sequence number stay the same? I think we should check that too so that dropping and adding files doesn't conflict with other positional deletes or equality deletes.
|
||
// Ensure we don't have deltas... | ||
String addedDeleteFiles = | ||
snapshot.summary().getOrDefault(SnapshotSummary.ADDED_DELETE_FILES_PROP, "0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to check that there were no deleted positional/equality delete files? Allowing a restore in that case would possibly change the data of the snapshot, so probably best to exclude that type of delete snapshot as well.
8418f88
to
cb5fbb7
Compare
cb5fbb7
to
74a8632
Compare
This PR adds a new Spark procedure named
revert_delete
that can be used to add back files that were removed by a delete operation. It takes a table name and a snapshot ID as parameters. The snapshot ID must be for a snapshot that was a delete operation, otherwise an exception will be thrown. Also, if the operation was previously reverted, then an exception will be thrown.This procedure can be used to revert a delete operation that occurred in the past, after new snapshots have already been applied. In this case, a simple rollback will cause a loss of data from the new snapshots. For example, a maintenance operation may delete records based on an invalid TTL setting and that might not be immediately discovered by the table owner. In this case, new operations will continue to be performed on the table after the delete.
Example Spark SQL syntax:
call catalog.system.revert_delete('db.table', 123456)
One limitation is that merge-on-read deletes cannot be reverted. Another is that branches (other than main) are not currently supported, though this could be added easily in a follow-up if desired.