-
Notifications
You must be signed in to change notification settings - Fork 4k
kvserver: remove changed replicas in purgatory from replica set #114365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: remove changed replicas in purgatory from replica set #114365
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
9f562f4 to
a1fa0fb
Compare
b0c9a35 to
2fbf672
Compare
2fbf672 to
c8060ec
Compare
3e6646a to
bb2bacd
Compare
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job tracking this down. Sometimes the hardest bugs to diagnose require the fewest lines of code to resolve.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)
pkg/kv/kvserver/queue.go line 1276 at r1 (raw file):
if err != nil || item.replicaID != repl.ReplicaID() { bq.mu.Lock() bq.removeFromReplicaSetLocked(item.rangeID)
This is now mostly symmetrical with the handling of replica ID changes in baseQueue.pop, which provides confidence to this approach.
pkg/kv/kvserver/queue_test.go line 912 at r1 (raw file):
const rmReplCount = 2 repls[0].replicaID = 2 if err := tc.store.RemoveReplica(context.Background(), repls[1], repls[1].Desc().NextReplicaID, RemoveOptions{
nit: ctx is in scope.
pkg/kv/kvserver/queue_test.go line 939 at r1 (raw file):
return errors.Errorf("expected 0 purgatory replicas; got %d", v) } // Verify there are no replicas left in the replica set after finishing
Do we want to test that repls[0] (with the new replica ID) can now be added back into the queue? It may be worth pulling out a separate, smaller test case for that.
bb2bacd to
7e3d35e
Compare
kvoli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! It certainly was satisfying writing the (short) fix.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/queue_test.go line 912 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
ctxis in scope.
Updated to use the ctx in scope.
pkg/kv/kvserver/queue_test.go line 939 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we want to test that
repls[0](with the new replica ID) can now be added back into the queue? It may be worth pulling out a separate, smaller test case for that.
Good idea, I added this below.
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)
pkg/kv/kvserver/queue_test.go line 964 at r2 (raw file):
beforeSuccessCount := bq.successes.Count() beforeFailureCount := bq.failures.Count() bq.maybeAdd(context.Background(), repls[0], hlc.ClockTimestamp{})
nit: ctx is in scope here as well.
It was possible for a replica to be stuck processing in a queue's replica set. This could occur when a replica had recently been removed from purgatory for processing but was destroyed, or replica ID changed before being processed. When this occurred, the replica could never be processed by the queue again, potentially leading to decommission stalls, constraint violations or under(over)replication. Remove the replica from the queue set upon encountering a replica which was destroyed, or replica ID changed when processing purgatory. This prevents the replica from becoming stuck in a processing state in the queue set. Fixes: cockroachdb#112761 Fixes: cockroachdb#110761 Release note (bug fix): The store queues will no longer leave purgatory replicas which have changed replica IDs, or have been destroyed stuck unable to process via the respective queue again if re-added.
7e3d35e to
a24ba7f
Compare
kvoli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/queue_test.go line 964 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
ctxis in scope here as well.
Updated.
|
bors r=nvanbenschoten |
|
Build failed: |
|
Retrying bors r=nvanbenschoten |
|
Build succeeded: |
Closes cockroachdb#132546. No patch release of v22.2 and earlier has cockroachdb#114365, so they have the potential to be flaky when evacuating nodes using zone configs. Don't use zone config movement for these versions. Release note: None
Closes cockroachdb#132546. No patch release of v22.2 and earlier has cockroachdb#114365, so they have the potential to be flaky when evacuating nodes using zone configs. Don't use zone config movement for these versions. Release note: None
No patch release of v22.2 and earlier has cockroachdb#114365, so they have the potential to be flaky when evacuating nodes using zone configs. Don't use zone config movement for these versions. Release note: None
Closes cockroachdb#132546. No patch release of v22.2 and earlier has cockroachdb#114365, so they have the potential to be flaky when evacuating nodes using zone configs. Don't use zone config movement for these versions. Release note: None
It was possible for a replica to be stuck processing in a queue's
replica set. This could occur when a replica had recently been removed
from purgatory for processing but was destroyed, or replica ID changed
before being processed.
When this occurred, the replica could never be processed by the queue
again, potentially leading to decommission stalls, constraint violations
or under(over)replication.
Remove the replica from the queue set upon encountering a replica which
was destroyed, or replica ID changed when processing purgatory. This
prevents the replica from becoming stuck in a processing state in the
queue set.
Fixes: #112761
Fixes: #110761
Release note (bug fix): The store queues will no longer leave purgatory
replicas which have changed replica IDs, or have been destroyed stuck
unable to process via the respective queue again if re-added.