-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Skip processing snapshots of type Overwrite during readStream #3517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rdblue
merged 10 commits into
apache:master
from
SreeramGarlapati:skip.overwrite.snapshots
Jan 28, 2022
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
6f81361
Skip processing snapshots of type Overwrite during readStream
SreeramGarlapati 5573942
Skip processing snapshots of type Overwrite during readStream
SreeramGarlapati cf3ceff
Add streaming-skip-overwrite-snapshots SparkReadOptions
rajarshisarkar cac4b49
Add Unit Test for streaming-skip-overwrite-snapshots SparkReadOptions
rajarshisarkar ad5bb27
Merge branch 'master' into skip.overwrite.snapshots
rajarshisarkar 6ac8b28
Merge branch 'master' into skip.overwrite.snapshots
rajarshisarkar 040010f
Resolved merge conflicts
rajarshisarkar cacc07f
Address review comments
rajarshisarkar b9e76b3
Merge branch 'master' into HEAD
rajarshisarkar bc2a386
Update SparkMicroBatchStream.java
rdblue File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -77,6 +77,7 @@ public class SparkMicroBatchStream implements MicroBatchStream { | |
| private final boolean localityPreferred; | ||
| private final StreamingOffset initialOffset; | ||
| private final boolean skipDelete; | ||
| private final boolean skipOverwrite; | ||
| private final Long fromTimestamp; | ||
|
|
||
| SparkMicroBatchStream(JavaSparkContext sparkContext, Table table, SparkReadConf readConf, boolean caseSensitive, | ||
|
|
@@ -95,6 +96,7 @@ public class SparkMicroBatchStream implements MicroBatchStream { | |
| this.initialOffset = initialOffsetStore.initialOffset(); | ||
|
|
||
| this.skipDelete = readConf.streamingSkipDeleteSnapshots(); | ||
| this.skipOverwrite = readConf.streamingSkipOverwriteSnapshots(); | ||
| } | ||
|
|
||
| @Override | ||
|
|
@@ -200,13 +202,26 @@ private List<FileScanTask> planFiles(StreamingOffset startOffset, StreamingOffse | |
|
|
||
| private boolean shouldProcess(Snapshot snapshot) { | ||
| String op = snapshot.operation(); | ||
| Preconditions.checkState(!op.equals(DataOperations.DELETE) || skipDelete, | ||
| "Cannot process delete snapshot: %s, to ignore deletes, set %s=true.", | ||
| snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS); | ||
| Preconditions.checkState( | ||
| op.equals(DataOperations.DELETE) || op.equals(DataOperations.APPEND) || op.equals(DataOperations.REPLACE), | ||
| "Cannot process %s snapshot: %s", op.toLowerCase(Locale.ROOT), snapshot.snapshotId()); | ||
| return op.equals(DataOperations.APPEND); | ||
| switch (op) { | ||
| case DataOperations.APPEND: | ||
| return true; | ||
| case DataOperations.REPLACE: | ||
| return false; | ||
| case DataOperations.DELETE: | ||
| Preconditions.checkState(skipDelete, | ||
| "Cannot process delete snapshot: %s, to ignore deletes, set %s=true", | ||
| snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_DELETE_SNAPSHOTS); | ||
| return false; | ||
| case DataOperations.OVERWRITE: | ||
| Preconditions.checkState(skipOverwrite, | ||
| "Cannot process overwrite snapshot: %s, to ignore overwrites, set %s=true", | ||
| snapshot.snapshotId(), SparkReadOptions.STREAMING_SKIP_OVERWRITE_SNAPSHOTS); | ||
| return false; | ||
| default: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should add a test for this new conf option
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added the Unit Test. |
||
| throw new IllegalStateException(String.format( | ||
| "Cannot process unknown snapshot operation: %s (snapshot id %s)", | ||
| op.toLowerCase(Locale.ROOT), snapshot.snapshotId())); | ||
| } | ||
| } | ||
|
|
||
| private static StreamingOffset determineStartingOffset(Table table, Long fromTimestamp) { | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this should use the same configuration to skip deletes and overwrites. Overwrites are different and I think that we should at a minimum have a different property. I would also prefer to have some additional clarity on how we plan to eventually handle this. We could skip overwrites, but what about use cases where they are probably upserts? What about when they're created by copy-on-write MERGE operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I would prefer they be two separate configs, but also that we have a plan for the longer term to handle sending out these row deltas.
I'd be ok with getting a PR in to ignore
OVERWRITE, but this isn't something we should ignore in the longer term (or even really the near-to-medium term) as others have mentioned.Personally I would consider using a schema similar to the delta.io change capture feed that has a dataframe with the before image / after image (row before and after update) and then the type of operation for each row (insert, delete, update_before, update_after).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I connected with @SreeramGarlapati to contribute on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a separate config to skip overwrites. I will discuss with @SreeramGarlapati and will update on the plan to eventually handle upserts.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all in all, there are 2 options for reading upserts:
copy on write-- a new data file is created which has a combination of both old rows and these new updated rows. So, in this case - we can take a spark option from the user to take consent - that they are okay with data replay.merge on read- we will expose an option to read change data feed - where we will include a metadata column - which indicates whether a record is an INSERT vs DELETE.did this make sense - @rdblue & @kbendick