VDiff: wait for workflow lock with exponential backoff#18998
Conversation
Signed-off-by: Matt Lord <mattalord@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18998 +/- ##
==========================================
- Coverage 69.89% 69.89% -0.01%
==========================================
Files 1612 1612
Lines 215826 215847 +21
==========================================
+ Hits 150857 150862 +5
- Misses 64969 64985 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Matt Lord <mattalord@gmail.com>
5503604 to
b1100bb
Compare
|
📝 Documentation updates detected! New suggestion: Add changelog entry for VDiff exponential backoff lock wait |
mhamza15
left a comment
There was a problem hiding this comment.
Nice! Would some jitter be helpful here as well?
nickvanw
left a comment
There was a problem hiding this comment.
I do think we should add some jitter, but this is better than the status quo even without it.
| if retryDelay < maxRetryDelay { | ||
| retryDelay = min(time.Duration(float64(retryDelay)*backoffFactor), maxRetryDelay) | ||
| } |
There was a problem hiding this comment.
I think it'd probably be a good idea to add a jitter factor here, something like:
jitter := time.Duration(rand.Int63n(int64(retryDelay / 4)))
retryDelay = min(time.Duration(float64(retryDelay)*backoffFactor)+jitter, maxRetryDelay)You could also apply it to the time.After and do something like
case <-time.After(retryDelay + time.Duration(rand.Int63n(int64(retryDelay/4)))):Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
|
📝 Documentation updates detected! Updated existing suggestion: Add changelog entry for VDiff exponential backoff lock wait |
|
📝 Documentation updates detected! New suggestion: Add changelog entry for VDiff exponential backoff lock wait |
Description
Workflow named locks are used to coordinate modifications to VReplication workflows (first added in: #16260).
When VDiff initializes a table diff, it performs the following steps on each target shard primary tablet:
- Gets a READ table level lock on the table we're diffing to prevent changes
- Initializes a consistent read view with
START TRANSACTION WITH CONSISTENT SNAPSHOT- Gets a the current GTID snapshot/position (
SELECT @@global.gtid_executed)- We release the table level lock
- Now we have a consistent read for the given table at a logical position that we can mirror on the target
stop _posset so that we replicate the state (GTIDs) from the source shard until we are at the same logical position that represents the same state we have in the source tablet in our consistent read view--filtered-replication-wait-time(default is 30s) for the workflow stream to catch up and stopSTART TRANSACTION WITH CONSISTENT SNAPSHOTcommandstop_posand the "for vdiff" message so that the shard's workflow stream is once again running normallyDuring this initialization phase, the table differ takes a lock on the workflow to prevent concurrent updates by other VDiffs and other VReplication commands. It holds this lock throughout these initialization steps as it is manipulating the workflow.
When a VDiff starts — whether from
create,resume, or automatic resume on error (which happens every 30 seconds) — every shard tries to start running and each primary tablet in the given shard tries to take the workflow lock in order to perform the initialization steps. There is a timeout of 45 seconds by default (--lock-timeout) when attempting to acquire this lock. When you have hundreds of shards and the replication catch up step (3 and 4 above) takes some seconds... you can easily hit the lock wait timeout on a high number of shards. This then causes the vdiff to be retried 30 seconds later (as we treat the lock error as any other error that is retryable), on all shards. So once again, you have hundreds of shards trying to initialize at the same time and competing for the lock. This can lead to much longer than necessary VDiff execution times in large busy keyspaces as we slowly start the VDiff on each of the shards. This was exacerbated by the fact that we were timing out the table diff query at the--vreplication-copy-phase-durationlimit (defaults to 1h) which meant that we had to perform the initialization step all over again on the shards (this was addressed in: #19044).In this PR we address this by NOT treating the workflow lock wait timeout as any other kind of retryable error, and instead retrying to acquire the lock indefinitely, with an exponential backoff, until we are able to acquire it. This prevents the repeated 30 second delay and the repeated thundering herd that was always coming to get the lock when the VDiff was retried/resumed.
Related Issue(s)
Checklist
AI Disclosure
My man Claude helped write the bulk of the unit tests.