-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26425][SS] Add more constraint checks to avoid checkpoint corruption #25965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm also feeling that commitLog can be also guarded for same, but want to be sure before updating the patch. spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala Lines 566 to 572 in 8167714
|
|
Test build #111570 has finished for PR 25965 at commit
|
|
@tdas @zsxwing @jose-torres @gaborgsomogyi Kindly reminder. |
|
retest this, please |
|
Test build #116587 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #121235 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #122134 has finished for PR 25965 at commit
|
|
Test build #122934 has finished for PR 25965 at commit
|
… to avoid checkpoint corruption
|
Rebased - the part of this PR was resolved via SPARK-30915. I've updated to address commit log instead. |
|
Test build #123147 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #123157 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #123211 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #123221 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #123992 has finished for PR 25965 at commit
|
Known flaky test. |
|
retest this, please |
|
Test build #124003 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #125731 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #127538 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #127553 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #127561 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #128679 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #128704 has finished for PR 25965 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. This change is simple and straightforward.
| watermarkTracker.updateWatermark(lastExecution.executedPlan) | ||
| commitLog.add(currentBatchId, CommitMetadata(watermarkTracker.currentWatermark)) | ||
| assert(commitLog.add(currentBatchId, CommitMetadata(watermarkTracker.currentWatermark)), | ||
| s"Concurrent update to the log. Multiple streaming jobs detected for $currentBatchId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to the log -> update to the commit log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Addressed.
|
Test build #128745 has finished for PR 25965 at commit
|
|
retest this, please |
|
Test build #128753 has finished for PR 25965 at commit
|
|
Again, this is simple and straightforward change. Tests are passed. If no objection, I think we can move forward to merge this. |
|
Thanks for reviewing. This PR didn't have any valid review comments in 1 year despite of mentioning, so I don't expect more reviews. I'll go ahead merging. |
|
Merged into master branch. |
What changes were proposed in this pull request?
Credits to @tdas who reported and described the fix to SPARK-26425. I just followed the description of the issue.
This patch adds more checks on commit log as well as file streaming source so that multiple concurrent runs of streaming query don't mess up the status of query/checkpoint. This patch addresses two different spots which are having a bit different issues:
In structured streaming, we don't allow multiple streaming queries to run with same checkpoint (including concurrent runs of same query), so query should fail if it fails to write the metadata of specific batch ID due to same batch ID being written by others.
As described in JIRA issue, assertion is already applied to the
offsetLogfor the same reason.spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
Lines 394 to 402 in 8167714
This patch applied the same for commit log.
Why are the changes needed?
This prevents the inconsistent behavior on streaming query and lets query fail instead.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
N/A, as the change is simple and obvious, and it's really hard to artificially reproduce the issue.