-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
The recently-added decommission nudger (#130117) is responsible for periodically enqueuing ranges with decommissioning replicas to the replicate queue. However, we have seen in recent escalations that, in cases of a lot of pending replicate queue actions, such replicas still block decommissioning; the replicas stop doing so only after being manually (via the DB console) enqueued to the replicate queue.
We have an issue (#148090) to have better visibility into whether the nudger is doing its job, but assuming it is, the way it enqueues replicas still differs from the manual enqueueing:
- The nudger enqueues the replica via AddAsync, with a mid-level priority corresponding to
AllocatorReplaceDecommissioningVoter. Then the replica waits for its turn to be processed. - The manual enqueuing (with or without
skipShouldQueue) callsstore.Enqueuewithasync=false, and the replica is processed directly.
We should consider either changing the priority associated with AllocatorReplaceDecommissioningVoter, or allowing the nudger to process replicas (more) directly.
Companion issue: #148090.
Jira issue: CRDB-52839