Skip to content

server: consider removing the decommission nudger #150667

@miraradeva

Description

@miraradeva

The recently-added decommission nudger (#130117) is responsible for periodically enqueuing ranges with decommissioning replicas to the replicate queue. However, we have seen in recent escalations that, in cases of a lot of pending replicate queue actions, such replicas still block decommissioning; the replicas stop doing so only after being manually (via the DB console) enqueued to the replicate queue.

We have an issue (#148090) to have better visibility into whether the nudger is doing its job, but assuming it is, the way it enqueues replicas still differs from the manual enqueueing:

  • The nudger enqueues the replica via AddAsync, with a mid-level priority corresponding to AllocatorReplaceDecommissioningVoter. Then the replica waits for its turn to be processed.
  • The manual enqueuing (with or without skipShouldQueue) calls store.Enqueue with async=false, and the replica is processed directly.

We should consider either changing the priority associated with AllocatorReplaceDecommissioningVoter, or allowing the nudger to process replicas (more) directly.

Companion issue: #148090.

Jira issue: CRDB-52839

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions