Skip to content

kvserver: decommission slowness/stall #151775

@wenyihu6

Description

@wenyihu6

Describe the problem

We have been seeing decommission stalls. It remains unclear whether the process was truly stuck or just slow. This issue summarizes my findings so far. Interestingly, the decommission nudger

func (r *Replica) maybeEnqueueProblemRange(
didn’t help, though manually enqueuing ranges through the replicate queue consistently unblocks the process according to customers.

From reading the code, there are several things that could contribute to decommissioning slowness (though none fully explain a true stall):

  1. Previously, two things have been done to help with decommissioning. server: react to decommissioning nodes by proactively enqueuing their replicas #80993 introduced a callback which proactively enqueued all replicas on a node to the replicate queue when the node is detected to be decommissioning. Because processing may fail in the middle, kvserver: retry failures to rebalance decommissioning replicas #81005 added replicas failed with decommissioning actions to purgatory queues which allow retrying at a faster interval.
  2. Two main issues here:
  1. The nudger should help with the two issues above: it enqueues the leaseholder replica of the decommissioning replicas into the priority queue. This generally helps, but if replicaCanBeProcessed from above fails again, we are forced to wait for the next nudger cycle / replica scanner.
  2. One more complication: replicas are enqueued every 15 minutes, but processing replicas take time. Ideally, they should be placed at the front of the queue if there are not many add-voter or replace-learner or replace-dead replicas.
    case AllocatorReplaceDecommissioningVoter:
  3. A key difference with manual enqueue is that async=false
    trace, processErr, enqueueErr := store.Enqueue(ctx, queue, repl, skipShouldQueue, false /* async */)
    , which bypasses the priority queue entirely. This means that if replicas are truly stuck and never processed, the issue may lie in the base queue - possibly similar to kvserver: remove changed replicas in purgatory from replica set  #114365.

Other notable things:
- Destroy status: in the ranges from escalations, a few replicas were waiting for GC. Raising the question of whether some replicas were incorrectly labelled as destroyed

destroyed := repl.mu.destroyStatus
or whether certain things are buggy relatedly
repl, err := bq.getReplica(item.rangeID)
if err != nil || item.replicaID != repl.ReplicaID() {
. (don't think this is the case)

Jira issue: CRDB-53466

Metadata

Metadata

Assignees

Labels

A-kvAnything in KV that doesn't belong in a more specific category.A-kv-decom-rolling-restartDecommission and Rolling RestartsC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)P-3Issues/test failures with no fix SLAT-kvKV Teambranch-masterFailures and bugs on the master branch.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions