kvserver: decommission slowness/stall

**Describe the problem**

We have been seeing decommission stalls. It remains unclear whether the process was truly stuck or just slow. This issue summarizes my findings so far. Interestingly, the decommission nudger https://github.com/cockroachdb/cockroach/blob/47deb118fefe5a3da25536b9d301514b3e236218/pkg/kv/kvserver/replica.go#L2568 didn’t help, though manually enqueuing ranges through the replicate queue consistently unblocks the process according to customers. 

From reading the code, there are several things that could contribute to decommissioning slowness (though none fully explain a true stall):
1. Previously, two things have been done to help with decommissioning. https://github.com/cockroachdb/cockroach/pull/80993 introduced a callback which proactively enqueued all replicas on a node to the replicate queue when the node is detected to be decommissioning. Because processing may fail in the middle, https://github.com/cockroachdb/cockroach/pull/81005 added replicas failed with decommissioning actions to purgatory queues which allow retrying at a faster interval. 
2. Two main issues here: 
- If replicaCanbeProcessed failed at either of these places - https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/queue.go#L911 https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/queue.go#L963, https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/queue.go#L1321, things will not be added to the purgatory queue since errors returned from replicaCanBeProcessed are not IsPurgatoryError. It would wait for the next time the replica is scanned (every 10 mins) or decommission nudger. The replica scanner by default also checks for shouldQueue https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/replicate_queue.go#L614. If they fail again, the same story repeats. 
- Since we are relying on replicas to get processed with one shot, it also doesn’t work well if the thing got computed out is a lease transfer https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/allocator/plan/replicate.go#L606 or https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/allocator/plan/replicate.go#L702 in which case we are relying on the new leaseholder replica to remove the replicas on the decommissioning nodes. But they don't get enqueued. 
3. The nudger should help with the two issues above: it enqueues the leaseholder replica of the decommissioning replicas into the priority queue. This generally helps, but if `replicaCanBeProcessed` from above fails again, we are forced to wait for the next nudger cycle / replica scanner. 
4. One more complication: replicas are enqueued every 15 minutes, but processing replicas take time. Ideally, they should be placed at the front of the queue if there are not many add-voter or replace-learner or replace-dead replicas. https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L253 
5. A key difference with manual enqueue is that async=false https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/stores_base.go#L64, which bypasses the priority queue entirely. This means that if replicas are truly stuck and never processed, the issue may lie in the base queue - possibly similar to https://github.com/cockroachdb/cockroach/pull/114365.


----

**Other notable things:** 
~~- Destroy status: in the ranges from escalations, a few replicas were waiting for GC. Raising the question of whether some replicas were incorrectly labelled as destroyed  https://github.com/cockroachdb/cockroach/blob/0d0a35d58da307083f9f20196d92d0e64814bc37/pkg/kv/kvserver/store.go#L555 or whether certain things are buggy relatedly https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/queue.go#L1310-L1311.~~ (don't think this is the case)
- maybeSwitchLeaseType: https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/queue.go#L1070 is not called from manual enqueue but from replicaCanBeProcessed. Is it necessary to call this right after `redirectOnOrAcquireLease` which already calls `requestLeaseLocked`. 
- Is it intentional to call `replicaCanBeProcessed` here https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/queue.go#L911. `processReplica` calls into it again https://github.com/cockroachdb/cockroach/blob/18dfffd92dbd7f5329427a6e747f24131120858e/pkg/kv/kvserver/queue.go#L922. This also seems like a behaviour change from c9cf06893bf827a1752213aa3aebee2aaea35f13 which looks unintended. 
- The customer performs draining and node restarts on the cluster, so it’s possible some stalls were caused by the stale draining state. However, it wouldn’t quite explain why manual enqueue unblocks the stall.
- I also found this issue https://github.com/cockroachdb/cockroach/issues/79266 which can add instability in the presence of node liveness flakiness.
- In both cases, a lease transfer happened to the range before decommissioning. However, the stalled state doesn’t require the leaseholder itself to be decommissioned. 
- We have previously seen decommission stall linked to prolonged leaseholder leader split in roachtests (https://github.com/cockroachdb/cockroach/issues/148884, https://github.com/cockroachdb/cockroach/issues/151190). I don’t think this is what we saw in escalations yet since the stalls happened when follower replicas are decommissioned (no lease transfer needed) and don’t explain why manual enqueueing unblocks progress. 

Jira issue: CRDB-53466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kvserver: decommission slowness/stall #151775

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	repl, err := bq.getReplica(item.rangeID)
	if err != nil \|\| item.replicaID != repl.ReplicaID() {

kvserver: decommission slowness/stall #151775

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions