[release-15.0] [Backport]: VReplication: Prevent Orphaned VDiff2 Jobs (#11768)#11943
Merged
mattlord merged 1 commit intovitessio:release-15.0from Dec 13, 2022
Merged
Conversation
* Prevent orphaned VDiffs in two ways... 1. When opening the engine, restart any vdiffs that are in the started state as this indicates it did not complete and was unable to save the final state and must be restarted. 2. When a vdiff run fails, retry saving the error state with an exponential backoff until the engine shuts down. This way the normal retry mechanism will kick in OR #1 will kick in when the engine is next opened on the primary tablet. Signed-off-by: Matt Lord <mattalord@gmail.com> * Handle failures before vdiff_table records are created Signed-off-by: Matt Lord <mattalord@gmail.com> * Add more ephemeral client errors Signed-off-by: Matt Lord <mattalord@gmail.com> * Show vdiff state of error even if no vdiff_table records Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor cleanup Signed-off-by: Matt Lord <mattalord@gmail.com> * Add vdiff2 unit tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Add unit test for retry Signed-off-by: Matt Lord <mattalord@gmail.com> * Small cleanup Signed-off-by: Matt Lord <mattalord@gmail.com> * Addressing review comments and other improvements Signed-off-by: Matt Lord <mattalord@gmail.com> * Use warning log for ... warnings :-) Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor touch ups Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Matt Lord <mattalord@gmail.com>
Member
|
To expand on the reasoning for the backport, we've been running VDiff2 on fairly large dataset (10+TB), and have seen tremendous reductions in diff execution time compared to VDiff1. Unfortunately, reliability was way below our expectations, partially due to the bugs fixed by these changes. Backporting this will make it easier to actually test VDiff2 for other users with similar data sizes as us. |
Contributor
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
rohit-nayak-ps
approved these changes
Dec 12, 2022
arthurschreiber
approved these changes
Dec 13, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This addresses two ways a VDiff could become "orphaned" and require a manual
StopandResumestep:We also begin to build out the unit testing framework for VDiff2 here and use that to test the related behavior.
Manual Tests
Failure before the vdiff fully initializes
Test:
Results:
Failure after vdiff initialization
Test:
Results:
Related Issue(s)
Checklist