-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the problem
We have been seeing decommission stalls. It remains unclear whether the process was truly stuck or just slow. This issue summarizes my findings so far. Interestingly, the decommission nudger
cockroach/pkg/kv/kvserver/replica.go
Line 2568 in 47deb11
| func (r *Replica) maybeEnqueueProblemRange( |
From reading the code, there are several things that could contribute to decommissioning slowness (though none fully explain a true stall):
- Previously, two things have been done to help with decommissioning. server: react to decommissioning nodes by proactively enqueuing their replicas #80993 introduced a callback which proactively enqueued all replicas on a node to the replicate queue when the node is detected to be decommissioning. Because processing may fail in the middle, kvserver: retry failures to rebalance decommissioning replicas #81005 added replicas failed with decommissioning actions to purgatory queues which allow retrying at a faster interval.
- Two main issues here:
- If replicaCanbeProcessed failed at either of these places -
cockroach/pkg/kv/kvserver/queue.go
Line 911 in 18dfffd
if _, err := bq.replicaCanBeProcessed(ctx, repl, false /*acquireLeaseIfNeeded */); err != nil { ,cockroach/pkg/kv/kvserver/queue.go
Line 963 in 18dfffd
conf, err := bq.replicaCanBeProcessed(ctx, repl, true /* acquireLeaseIfNeeded */) , things will not be added to the purgatory queue since errors returned from replicaCanBeProcessed are not IsPurgatoryError. It would wait for the next time the replica is scanned (every 10 mins) or decommission nudger. The replica scanner by default also checks for shouldQueuecockroach/pkg/kv/kvserver/queue.go
Line 1321 in 18dfffd
bq.finishProcessingReplica(ctx, stopper, repl, err) . If they fail again, the same story repeats.cockroach/pkg/kv/kvserver/replicate_queue.go
Line 614 in 18dfffd
func (rq *replicateQueue) shouldQueue( - Since we are relying on replicas to get processed with one shot, it also doesn’t work well if the thing got computed out is a lease transfer or
transferOp, err := rp.maybeTransferLeaseAwayTarget( in which case we are relying on the new leaseholder replica to remove the replicas on the decommissioning nodes. But they don't get enqueued.transferOp, err := rp.maybeTransferLeaseAwayTarget(
- The nudger should help with the two issues above: it enqueues the leaseholder replica of the decommissioning replicas into the priority queue. This generally helps, but if
replicaCanBeProcessedfrom above fails again, we are forced to wait for the next nudger cycle / replica scanner. - One more complication: replicas are enqueued every 15 minutes, but processing replicas take time. Ideally, they should be placed at the front of the queue if there are not many add-voter or replace-learner or replace-dead replicas.
case AllocatorReplaceDecommissioningVoter: - A key difference with manual enqueue is that async=false , which bypasses the priority queue entirely. This means that if replicas are truly stuck and never processed, the issue may lie in the base queue - possibly similar to kvserver: remove changed replicas in purgatory from replica set #114365.
cockroach/pkg/kv/kvserver/stores_base.go
Line 64 in 18dfffd
trace, processErr, enqueueErr := store.Enqueue(ctx, queue, repl, skipShouldQueue, false /* async */)
Other notable things:
- Destroy status: in the ranges from escalations, a few replicas were waiting for GC. Raising the question of whether some replicas were incorrectly labelled as destroyed
cockroach/pkg/kv/kvserver/store.go
Line 555 in 0d0a35d
| destroyed := repl.mu.destroyStatus |
cockroach/pkg/kv/kvserver/queue.go
Lines 1310 to 1311 in 18dfffd
| repl, err := bq.getReplica(item.rangeID) | |
| if err != nil || item.replicaID != repl.ReplicaID() { |
- maybeSwitchLeaseType: is not called from manual enqueue but from replicaCanBeProcessed. Is it necessary to call this right after
cockroach/pkg/kv/kvserver/queue.go
Line 1070 in 18dfffd
if pErr != nil { redirectOnOrAcquireLeasewhich already callsrequestLeaseLocked. - Is it intentional to call
replicaCanBeProcessedhere.cockroach/pkg/kv/kvserver/queue.go
Line 911 in 18dfffd
if _, err := bq.replicaCanBeProcessed(ctx, repl, false /*acquireLeaseIfNeeded */); err != nil { processReplicacalls into it again. This also seems like a behaviour change from c9cf068 which looks unintended.cockroach/pkg/kv/kvserver/queue.go
Line 922 in 18dfffd
err := bq.processReplica(ctx, repl) - The customer performs draining and node restarts on the cluster, so it’s possible some stalls were caused by the stale draining state. However, it wouldn’t quite explain why manual enqueue unblocks the stall.
- I also found this issue kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266 which can add instability in the presence of node liveness flakiness.
- In both cases, a lease transfer happened to the range before decommissioning. However, the stalled state doesn’t require the leaseholder itself to be decommissioned.
- We have previously seen decommission stall linked to prolonged leaseholder leader split in roachtests (roachtest: decommissionBench/nodes=6/warehouses=1000/drain-first/while-upreplicating/target=2/multi-region failed #148884, roachtest: decommissionBench/nodes=6/warehouses=1000/drain-first/while-upreplicating/target=3/multi-region failed #151190). I don’t think this is what we saw in escalations yet since the stalls happened when follower replicas are decommissioned (no lease transfer needed) and don’t explain why manual enqueueing unblocks progress.
Jira issue: CRDB-53466