[Spark] Streaming Source - Skip OVERWRITE snapshot operations when streaming-skip-delete-snapshots is true #3267
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds
streaming-skip-delete-snapshotsis set to true (to show the behavior and pair with the unit test for the default case, which sets the flag to false).This closes #3265
Detailed explanation & A Motivating Use Case
I've recently started an investigation into what would be needed for a true streaming CDC source, possibly to be made into a proposal to update the streaming sources (which could also arguably be used for the Flink streaming source) to handle deletions etc. Possibly one of my colleagues at Apple will take over it or I can collaborate with them on it later on depending on my new upcoming workload. Happy to share findings with anybody (though would need access to some of the notes I wrote previously as I've moved on from Apple).
The Spark MicroBatch streaming source can currently only handle snapshots that do not mutate or delete any of the existing rows and still produce a "correct" stream.
This means that it can presently handle two types of snapshots:
DataOperations.APPEND: New data files are added to the table.DataOperations.REPLACE: Files are replaced, without changing the row level data in the table (e.g. data file rewrites).Users can choose to skip "delete" type snapshots via the read option
streaming-skip-delete-snapshots, which simply skips the given snapshot if it potentially contains a delete. This will update the streaming source to allow skipping snapshots that contain any kind of row level data mutations.OVERWRITEtype snapshots are a form of delete (as well as insertion) and so one can argue that they should be skippable if users choose to skip deletes.Consider the case where users are simply interested in using Spark to get an Iceberg source stream of a table they treat as append only (which is more or less what they can do now), but data is then reprocessed for maintenance as part of a batch operation to change its compression type or file type for example. There would be no easy way to skip these out-of-band one time
OVERWRITEoperations committed to the table (which are more akin toREPLACEoperations if only the compression or file format is changed, in that no row level data would actually be changed). Changing a table's compression for past data is not at all unheard of.However,
OVERWRITEsnapshots can add data (as opposed to snapshots with trueDELETEoperation). While it would potentially be possible to grab only the added data files and still skip the deletes in a limited set of cases, given that users are skipping deletes, it seems fair to allow them to skip mixed append and deletes as well.We could also introduce another read option to skip
OVERWRITE, allowing users more granular choices to avoid skipping processing on important table updates.I'd personally rather the time spent implementing anything more complex than an additional flag go into refactoring the Spark MicroBatch stream to truly produce CDC data (something I have began investigating an API for and researching existing APIs in similar systems, given that it's not supported in Spark natively). But I'm very much open to discussion on this.
When we refactor the spark streaming source to handle deletions, we will of course be sure to handle commits that both delete and add data at the same time. Hence why I think this more explicit addition is good enough for now.
If we don't want to take this relatively simplistic approach, at the very least, a test should be added indicating the intended behavior when "streaming-skip-delete-snapshots" is true, as there's a test showing that OVERWRITE snapshots will fail an Iceberg spark streaming source when the option is not used or is set to false (its default value).