release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062

wenyihu6 · 2025-09-05T18:35:24Z

Backport:

1/1 commits from "kvserver: add observability (metrics & logging) for decommissioning nudger" (kvserver: add observability (metrics & logging) for decommissioning nudger #151898)
4/4 commits from "kvserver: small changes " (kvserver: small changes #152508)
5/5 commits from "kvserver: pass priority at enqueue time to baseQueue.process " (kvserver: pass priority at enqueue time to baseQueue.process #152512)
1/1 commits from "allocator: correct logging for priority assertion" (allocator: correct logging for priority assertion #152675)
1/1 commits from "kvserver: remove bq.replicaCanBeProcessed right before bq.processReplica" (kvserver: remove bq.replicaCanBeProcessed right before bq.processReplica #152507)
5/9 commits from "kvserver: add ReplicateQueueMaxSize" (kvserver: add ReplicateQueueMaxSize #152699)
9/9 commits from "kvserver: requeue on priority inversion for replicate queue" (kvserver: requeue on priority inversion for replicate queue #152596)
9/9 commits from "kvserver: add onProcessResult and onEnqueueResult to processCallback" (kvserver: add onProcessResult and onEnqueueResult to processCallback #152792)
2/2 commits from "kvserver: allow logs from callbacks up to 15 replicas per updateReplicationGauges" (kvserver: allow logs from callbacks up to 15 replicas per updateReplicationGauges #152885)
4/4 commits from "kvserver: improve observability with decommission nudger" (kvserver: improve observability with decommission nudger #152787)
3/3 commits from "kvserver: track priority inversion in replicate queue metrics" (kvserver: track priority inversion in replicate queue metrics #152697)

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: critical potential fix and observability improvement for decommission stall

Previously, the decommissioning nudger had limited observability, making it difficult to monitor its effectiveness and diagnose issues during node decommissioning operations. This was inadequate because operators couldn't track how many ranges were being enqueued for decommissioning, nor could they see when the nudger skipped ranges due to leaseholder status or invalid leases. To address this, this patch adds comprehensive logging and metrics: - Logs when the decommissioning nudger enqueues replicas with priority info - Tracks successful enqueues via DecommissioningNudgerEnqueueEnqueued metric - Tracks skipped enqueues via DecommissioningNudgerNotLeaseholderOrInvalidLease metric - Adds structured logging for debugging nudger behavior - Includes comprehensive test coverage for the new metrics TODO: - Figure out a better way to track decommissioning enqueue failures. Currently it's hard to "get as close as we can" to the source of the enqueue failures for logging & metrics purposes - this would require a better architecting of the code pathways to ensure we log and track failures as close as we can to where they occur Fixes: CRDB-51396 Release note: None

This commit refactors isDecommissionAction into allocatorimpl for consistency with other similar helpers like allocatorActions.{Add,Replace,Remove}. This change has no behavior changes but to make future commits easier.

This commit simplifies the logging in `maybeEnqueueProblemRange` to log two booleans directly.

Previously, the comment on the queue incorrectly stated that it removes the lowest-priority element when exceeding its maximum size. This was misleading because heap only guarantees that the root is the highest priority, not that elements are globally ordered. This commit updates the comment to clarify that the removed element might not be the lowest priority. Ideally, we should drop the lowest priority element when exceeding queue size, but heap doesn't make this very easy.

This commit adds logging for ranges dropped from the base queue due to exceeding max size, improving observability. The log is gated behind V(1) to avoid verbosity on nodes with many ranges.

This commit plumbs the enqueue time priority into baseQueue.processReplica, enabling comparison between the priority at enqueue time and at processing time. For now, we pass -1 in all cases except when processing replicas directly from the base queue, where -1 signals that priority verification should be skipped. No logic change has been made yet to check for priority inversion; future commits will extend processReplica to validate that processing priority has not differed significantly from the enqueue time priority.

Previously, a replicaItem’s priority was cleared when marked as processing, to indicate it was no longer in the priority queue. This behavior made sense when the purgatory queue did not track priorities. However, we now need to preserve priorities for items in purgatory as well since they will be calling into baseQueue.processReplica. This commit removes the priority reset in replicaItem.SetProcessing(), ensuring that the enqueue time priority is retained when replicas are popped from the heap and passed into the purgatory queue properly. No behavior change should happen from this change.

Previously, replica items in the purgatory queue did not retain their enqueue time priority. This commit ensures that the priority is preserved so it can be passed to baseQueue.processReplica when processing items from purgatory.

This commit adds an assertion to Allocator.ComputeAction to ensure that priority is never -1 in cases where it shouldn’t be. Normally, ComputeAction returns action.Priority(), but we sometimes adjust the priority for specific actions like AllocatorAddVoter, AllocatorRemoveDeadVoter, and AllocatorRemoveVoter. A priority of -1 is a special case reserved for processing logic to run even if there’s a priority inversion. If the priority is not -1, the range may be re-queued to be processed with the correct priority.

This commit adds additional invariants to verify the correctness of priority plumbing for range items in base queue tests.

This commit fixes an incorrect log statement in computeAction for priority assertions. The log was mistakenly emitted even when the priority was not -1. Related: cockroachdb#152512 Release note: none

Previously, we called bq.replicaCanBeProcessed with acquireLeaseIfNeeded = false before invoking bq.processReplica, which itself calls replicaCanBeProcessed with acquireLeaseIfNeeded = true. This looks incorrect and did not exist prior to cockroachdb@c9cf068. It’s unclear how often lease renewal is actually going to be helpful here, but I removed these two calls since they were newly introduced and seem unintentional. Informs: cockroachdb#151292 Release note: none

Previously, we had limited observability into when queues drop replicas due to reaching their maximum size. This commit adds a metric to track and observe such events.

Previously, the maximum base queue size was hardcoded to defaultQueueMaxSize (10000). Since replica item structs are small, there’s little reason to enforce a fixed limit. This commit makes the replicate queue size configurable via a cluster setting ReplicateQueueMaxSize, allowing incremental and backport-friendly adjustments. Note that reducing the setting does not drop replicas appropirately; future commits will address this behavior.

This commit adds tests to (1) verify metric updates when replicas are dropped from the queue, and (2) ensure the cluster setting for ReplicateQueueMaxSize works correctly.

Previously, the ReplicateQueueMaxSize cluster setting allowed dynamic adjustment of the replicate queue’s maximum size. However, decreasing this setting did not properly drop excess replicas. This commit fixes that by removing replicas when the queue’s max size is lowered.

This commit improves the clarity around the naming and description of the metrics.

This commit adds a new cluster setting PriorityInversionRequeue that controls whether the replicate queue should requeue replicas when their priority at enqueue time differs significantly from their priority at processing time (e.g. dropping from top 3 to the lowest priority).

Previously, a replica could enter the queue with high priority but, by the time it was processed, the action planned for this replica may have a low priority, causing us to perform low priority work. Specifically, we are mostly worried about cases where the priority changes from any of the repair actions to consider rebalance. Rebalancing could take a long time and block other ranges enqueued with actual repair action needed. This commit ensures that such replicas are requeued instead, avoiding priority inversions.

Previously, replicateQueue used V(2) to log info on priority inverted replicas because I wanted visibility into every case without missing any replicas. On reflection, the individual cases aren’t that interesting - it’s the overall volume that matters, which we can track through metrics. This commit changes it so that we just rate limit priority inversions every 3 seconds.

This commit improves the comments for PriorityInversionRequeue and clarifies the contracts around action.Priority().

This commit refactors CheckPriorityInversion.

This commit adds the TestAllocatorPriorityInvariance test, which acts as a regression safeguard when new actions are added to AllocatorAction, ensuring the contract is upheld. See action.Priority() and ComputeAction() for more details on the contract.

…equeue Previously, we introduced the PriorityInversionRequeue cluster setting, intended for backport, to handle cases where a range was enqueued with a high-priority repair action but, at processing time, a low-priority rebalance action was computed. In such cases, the caller re-adds the range to the queue under its updated priority. Although the cluster setting guards this requeue behavior, the inversion check always ran unconditionally, reducing backport safety. This commit updates the logic so that the cluster setting guard both the inversion check and the requeue behavior.

Previously, we checked for priority inversion before planning errors, which meant we could return requeue = true even when a planning error occurred. This commit changes it so that planning errors should take higher precedence over a priority inversion error. rq.processOneChange now returns early if there is a planning error and only check for priority inversion right before applying a change.

Previously, we checked for requeue right before returning for both nil and non-nil errors, making the code harder to follow. This commit refactors replicateQueue.process to requeue replicas before checking for errors.

maybeBackpressureBatch registers a callback with the split queue for replicas that are too large relative to their split size. This backpressures the range to stop it from growing and prevent new writes from blocking a pending split. The callback is invoked when the split queue finishes processing the replica. Previously, the error channel used in the callback had a size of 1 and performed blocking sends. This was safe because the base queue only sent a single error, and by the time maybeBackpressureBatch returned, the callback was guaranteed to have been consumed, and no additional sends would occur. Future commits will allow the callback to be invoked multiple times (although it should only twice at most). To be safe and avoid potential deadlocks from multiple sends after maybeBackpressureBatch already returns, this commit makes the error send non-blocking. If the channel is already full, the error is dropped, which is acceptable since we only care about observing the completion of the replica processing at least once.

baseQueue.Async may return immediately as a noop if the semaphore does not available capacity and the wait parameter is false. Previously, this case returned no error, leaving the caller unaware that the request was dropped. This commit changes the behavior to return a baseQueueAsyncRateLimited error, allowing callers to detect and handle the condition.

The base queue already supports registering callbacks that are invoked with the processing result of replicas once they are processed. However, replicas may fail before reaching that stage (e.g., failing to enqueue or dropped early). This commit extends the mechanism to also report enqueue results, allowing callers to detect failures earlier. Currently, only decommissioningNudger.maybeEnqueueProblemRange uses this. Note that one behavior change is introduced: previously, a registered callback would fire only once with the processing result and not again if the replica was later processed by the purgatory queue. With this change, the callback may now be invoked twice.

This commit adds TestBaseQueueCallbackOnEnqueueResult and TestBaseQueueCallbackOnProcessResult to verify that callbacks are correctly invoked with both enqueue and process results.

Previously, we updated bq.enqueueAdd inside the defer statement of addInternal. This was incorrect because we may return queued = true for a replica already processing and was marked for requeue. That replica would later be requeued in finishProcessingReplica, incrementing the metric again, lead to double counting.

…ueDecommissionScannerDisabled his commit extends TestBaseQueueCallback* and TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.

Previously, replicas could be enqueued at a high priority but end up processing a lower-priority actions, causing priority inversion and unfairness to other replicas behind them that needs a repair action. This commit adds metrics to track such cases. In addition, this commit also adds metrics to track when replicas are requeued in the replicate queue due to a priority inversion from a repair action to a rebalance action.

Previously, we added priority inversion requeuing mechanism. This commit adds a unit test that forces the race condition we suspected to be happening in escalations involving priority inversion and asserts that priority inversion occurs and that the replica is correctly requeued. Test set up: 1. range’s leaseholder replica is rebalanced from one store to another. 2. new leaseholder enqueues the replica for repair with high priority (e.g. to finalize the atomic replication change or remove a learner replica) 3. before processing, the old leaseholder completes the change (exits the joint config or removes the learner). 4. when the new leaseholder processes the replica, it computes a ConsiderRebalance action, resulting in a priority inversion and potentially blocking other high-priority work.

This commit removes per-action priority inversion metrics due to their high cardinality. We already have logging in place, which should provide sufficient observability. For now, we care about is priority inversion that leads to consider rebalance and requeuing the most.

blathers-crl · 2025-09-05T18:35:29Z

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

Non-production code changes. Includes test-only changes, build system changes, etc.
Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

blathers-crl · 2025-09-05T18:35:31Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-09-05T18:35:35Z

This change is

blathers-crl · 2025-09-05T18:35:44Z

❌ PR #153062 does not comply with backport policy

Confidence: high
Explanation: The pull request modifies production code files in the 'pkg/kv/kvserver/' directory among others, which indicates changes that impact the production behavior of CockroachDB. The PR includes extensive modifications across a variety of files, which involve changes in the queue systems, metrics enhancements, and general behavior adjustments of the replicate queue among others. Despite a 'Release justification: critical fix (potential) and observability improvement for decommission stall' provided, the details given in the justification do not conclusively establish the presence of a critical bug as defined in the backport policies. Observability improvements alone, while valuable, do not meet the criteria required for backporting under critical bug fixes unless they directly contribute to fixing or diagnosing a critical bug. Additionally, there is no mention of feature flag implementation in the description or the changed files, which is necessary for non-critical changes.
Recommendation: Reconsider the backport given the nature of changes, or provide additional justification for why the changes address a critical bug. If non-critical, ensure changes are gated behind a feature flag.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

wenyihu6 · 2025-09-05T19:43:20Z

Putting here as a reminder to myself: we will need to backport #153008 as well.

wenyihu6 · 2025-09-05T23:06:03Z

TFTR!

dodeca12 and others added 30 commits September 5, 2025 14:06

allocator: move isDecommissionAction to allocatorimpl

c3ee387

This commit refactors isDecommissionAction into allocatorimpl for consistency with other similar helpers like allocatorActions.{Add,Replace,Remove}. This change has no behavior changes but to make future commits easier.

kvserver: simplify logging in maybeEnqueueProblemRange

9ddbd22

This commit simplifies the logging in `maybeEnqueueProblemRange` to log two booleans directly.

kvserver: add logging for ranges dropped from base queue

926f353

This commit adds logging for ranges dropped from the base queue due to exceeding max size, improving observability. The log is gated behind V(1) to avoid verbosity on nodes with many ranges.

kvserver: plumb priority at enqueue for purgatory queue

b6f5eac

Previously, replica items in the purgatory queue did not retain their enqueue time priority. This commit ensures that the priority is preserved so it can be passed to baseQueue.processReplica when processing items from purgatory.

allocatorimpl: add invariants on priority to base queue tests

1a85966

This commit adds additional invariants to verify the correctness of priority plumbing for range items in base queue tests.

allocator: correct logging for priority assertion

612eff2

This commit fixes an incorrect log statement in computeAction for priority assertions. The log was mistakenly emitted even when the priority was not -1. Related: cockroachdb#152512 Release note: none

kvserver: add ReplicateQueueDroppedDueToSize

7f1141a

Previously, we had limited observability into when queues drop replicas due to reaching their maximum size. This commit adds a metric to track and observe such events.

kvserver: add TestReplicateQueueMaxSize

7b1934f

This commit adds tests to (1) verify metric updates when replicas are dropped from the queue, and (2) ensure the cluster setting for ReplicateQueueMaxSize works correctly.

kvserver: rename ReplicateQueueDroppedDueToSize to ReplicateQueueFull

39ace7b

This commit improves the clarity around the naming and description of the metrics.

kvserver: improve comments for PriorityInversionRequeue

7914bde

This commit improves the comments for PriorityInversionRequeue and clarifies the contracts around action.Priority().

allocator: small refactor for CheckPriorityInversion

7bcae61

This commit refactors CheckPriorityInversion.

kvserver: check for requeue before error checking in rq.process

ed70f3c

Previously, we checked for requeue right before returning for both nil and non-nil errors, making the code harder to follow. This commit refactors replicateQueue.process to requeue replicas before checking for errors.

kvserver: add TestBaseQueueCallback

0c7917c

This commit adds TestBaseQueueCallbackOnEnqueueResult and TestBaseQueueCallbackOnProcessResult to verify that callbacks are correctly invoked with both enqueue and process results.

wenyihu6 added 5 commits September 5, 2025 14:27

kvserver: test metrics in TestBaseQueueCallback* and TestReplicateQue…

c301eba

…ueDecommissionScannerDisabled his commit extends TestBaseQueueCallback* and TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.

blathers-crl bot added backport Label PR's that are backports to older release branches T-kv KV Team labels Sep 5, 2025

wenyihu6 marked this pull request as ready for review September 5, 2025 18:35

wenyihu6 requested a review from a team as a code owner September 5, 2025 18:35

wenyihu6 requested review from arulajmani and tbg September 5, 2025 18:35

arulajmani approved these changes Sep 5, 2025

View reviewed changes

wenyihu6 merged commit b8345b7 into cockroachdb:release-24.3.20-rc Sep 5, 2025
15 of 16 checks passed

celeste-cockroachdb bot added the target-release-24.3.20 label Sep 5, 2025

wenyihu6 deleted the backportrelease-24.3.20-rc-151898-152508-152512-152675-152507-152699-152596-152792-152885-152787-152697 branch September 24, 2025 01:28

celeste-cockroachdb bot added target-release-24.3.21 and removed target-release-24.3.20 labels Sep 24, 2025

tbg mentioned this pull request Oct 6, 2025

server: consider removing the decommission nudger #150667

Open

celeste-cockroachdb bot added target-release-24.3.22 and removed target-release-24.3.21 target-release-24.3.22 labels Oct 22, 2025

celeste-cockroachdb bot added v24.3.20 and removed target-release-24.3.23 labels Nov 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062

release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062

Uh oh!

wenyihu6 commented Sep 5, 2025 •

edited

Loading

Uh oh!

blathers-crl bot commented Sep 5, 2025 •

edited by wenyihu6

Loading

Uh oh!

blathers-crl bot commented Sep 5, 2025

Uh oh!

cockroach-teamcity commented Sep 5, 2025

Uh oh!

blathers-crl bot commented Sep 5, 2025 •

edited

Loading

Uh oh!

wenyihu6 commented Sep 5, 2025 •

edited

Loading

Uh oh!

wenyihu6 commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062

release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062

Uh oh!

Conversation

wenyihu6 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Sep 5, 2025 • edited by wenyihu6 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Sep 5, 2025

Uh oh!

cockroach-teamcity commented Sep 5, 2025

Uh oh!

blathers-crl bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenyihu6 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenyihu6 commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wenyihu6 commented Sep 5, 2025 •

edited

Loading

blathers-crl bot commented Sep 5, 2025 •

edited by wenyihu6

Loading

blathers-crl bot commented Sep 5, 2025 •

edited

Loading

wenyihu6 commented Sep 5, 2025 •

edited

Loading