Skip to content

kvserver: proactively enqueue replicas for a decommissioning node #79453

@aayushshah15

Description

@aayushshah15

Came up in a conversation with, and proposed by, @lidorcarmel.

Today, the replicaScanner on each store in the cluster keeps pacing through its replicas (such that it processes each replica once roughly every 10 mins). The replicaScanner keeps optionally queuing these replicas into each of the store's queues.

When a node is decommissioning, this status gets broadcast to other nodes in the cluster via gossip. So when the replicateQueue encounters a replica that is the leaseholder for a range that has a replica on a decommissioning node, it decides to take action to move that decommissioning replica away.

The issue here is that this discovery of decommissioning replicas is limited by the replicaScanners 10 min scanning interval. This means that, generally, even the discovery of all replicas belonging to a decommissioning node will take ~10 mins. Furthermore, if there are any errors processing any of these decommissioning replicas, they will not be re-processed for another 10 mins.

This issue proposes that we should actively enqueue all replicas belonging to a decommissioning node, into the replicateQueues of all of a node's stores the moment it learns that a node's liveness record has changed from LIVE to DECOMMISSIONING. Care will need to be taken to ensure that we're only enqueuing these replicas exactly once when the status of a node changes to DECOMMISSIONING. Doing this should considerably cut down on how long it takes to decommission nodes in almost all scenarios, and, anecdotally, it also seems like the behaviour that operators intuitively expect.

cc @cockroachdb/kv-notifications

Jira issue: CRDB-14873

Epic: CRDB-14621

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-postmortemOriginated from a Postmortem action item.T-kvKV Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions