Skip to content

Conversation

@wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Sep 5, 2025

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: critical potential fix and observability improvement for decommission stall

dodeca12 and others added 30 commits September 5, 2025 14:06
Previously, the decommissioning nudger had limited observability, making it
difficult to monitor its effectiveness and diagnose issues during node
decommissioning operations.

This was inadequate because operators couldn't track how many ranges were
being enqueued for decommissioning, nor could they see when the nudger
skipped ranges due to leaseholder status or invalid leases.

To address this, this patch adds comprehensive logging and metrics:
- Logs when the decommissioning nudger enqueues replicas with priority info
- Tracks successful enqueues via DecommissioningNudgerEnqueueEnqueued metric
- Tracks skipped enqueues via
DecommissioningNudgerNotLeaseholderOrInvalidLease metric
- Adds structured logging for debugging nudger behavior
- Includes comprehensive test coverage for the new metrics

TODO:
- Figure out a better way to track decommissioning enqueue failures.
	Currently it's hard to "get as close as we can" to the source of the
	enqueue failures for logging & metrics purposes - this would require a
	better architecting of the code pathways to ensure we log and track
	failures as close as we can to where they occur

Fixes: CRDB-51396
Release note: None
This commit refactors isDecommissionAction into allocatorimpl for consistency
with other similar helpers like allocatorActions.{Add,Replace,Remove}. This
change has no behavior changes but to make future commits easier.
This commit simplifies the logging in `maybeEnqueueProblemRange` to log two
booleans directly.
Previously, the comment on the queue incorrectly stated that it removes the
lowest-priority element when exceeding its maximum size. This was misleading
because heap only guarantees that the root is the highest priority, not that
elements are globally ordered. This commit updates the comment to clarify that
the removed element might not be the lowest priority. Ideally, we should drop
the lowest priority element when exceeding queue size, but heap doesn't make
this very easy.
This commit adds logging for ranges dropped from the base queue due to exceeding
max size, improving observability. The log is gated behind V(1) to avoid
verbosity on nodes with many ranges.
This commit plumbs the enqueue time priority into baseQueue.processReplica,
enabling comparison between the priority at enqueue time and at processing time.
For now, we pass -1 in all cases except when processing replicas directly from
the base queue, where -1 signals that priority verification should be skipped.
No logic change has been made yet to check for priority inversion; future
commits will extend processReplica to validate that processing priority has not
differed significantly from the enqueue time priority.
Previously, a replicaItem’s priority was cleared when marked as processing, to
indicate it was no longer in the priority queue. This behavior made sense when
the purgatory queue did not track priorities. However, we now need to preserve
priorities for items in purgatory as well since they will be calling into
baseQueue.processReplica. This commit removes the priority reset in
replicaItem.SetProcessing(), ensuring that the enqueue time priority is retained
when replicas are popped from the heap and passed into the purgatory queue
properly. No behavior change should happen from this change.
Previously, replica items in the purgatory queue did not retain their enqueue
time priority. This commit ensures that the priority is preserved so it can be
passed to baseQueue.processReplica when processing items from purgatory.
This commit adds an assertion to Allocator.ComputeAction to ensure that priority
is never -1 in cases where it shouldn’t be. Normally, ComputeAction returns
action.Priority(), but we sometimes adjust the priority for specific actions
like AllocatorAddVoter, AllocatorRemoveDeadVoter, and AllocatorRemoveVoter. A
priority of -1 is a special case reserved for processing logic to run even if
there’s a priority inversion. If the priority is not -1, the range may be
re-queued to be processed with the correct priority.
This commit adds additional invariants to verify the correctness of priority
plumbing for range items in base queue tests.
This commit fixes an incorrect log statement in computeAction for priority
assertions. The log was mistakenly emitted even when the priority was not -1.

Related: cockroachdb#152512
Release note: none
Previously, we called bq.replicaCanBeProcessed with acquireLeaseIfNeeded = false
before invoking bq.processReplica, which itself calls replicaCanBeProcessed with
acquireLeaseIfNeeded = true. This looks incorrect and did not exist prior to
cockroachdb@c9cf068. It’s unclear how often lease renewal is actually going to be helpful
here, but I removed these two calls since they were newly introduced and seem
unintentional.

Informs: cockroachdb#151292
Release note: none
Previously, we had limited observability into when queues drop replicas due to
reaching their maximum size. This commit adds a metric to track and observe such
events.
Previously, the maximum base queue size was hardcoded to defaultQueueMaxSize
(10000). Since replica item structs are small, there’s little reason to enforce
a fixed limit. This commit makes the replicate queue size configurable via a
cluster setting ReplicateQueueMaxSize, allowing incremental and backport-friendly adjustments. Note that reducing the setting does not drop replicas appropirately;
future commits will address this behavior.
This commit adds tests to (1) verify metric updates when replicas are dropped
from the queue, and (2) ensure the cluster setting for ReplicateQueueMaxSize
works correctly.
Previously, the ReplicateQueueMaxSize cluster setting allowed dynamic adjustment
of the replicate queue’s maximum size. However, decreasing this setting did not
properly drop excess replicas. This commit fixes that by removing replicas when
the queue’s max size is lowered.
This commit improves the clarity around the naming and description of the
metrics.
This commit adds a new cluster setting PriorityInversionRequeue that controls
whether the replicate queue should requeue replicas when their priority at
enqueue time differs significantly from their priority at processing time
(e.g. dropping from top 3 to the lowest priority).
Previously, a replica could enter the queue with high priority but, by the time
it was processed, the action planned for this replica may have a low priority,
causing us to perform low priority work. Specifically, we are mostly worried
about cases where the priority changes from any of the repair actions to
consider rebalance. Rebalancing could take a long time and block other ranges
enqueued with actual repair action needed. This commit ensures that such
replicas are requeued instead, avoiding priority inversions.
Previously, replicateQueue used V(2) to log info on priority inverted replicas
because I wanted visibility into every case without missing any replicas. On
reflection, the individual cases aren’t that interesting - it’s the overall
volume that matters, which we can track through metrics. This commit changes it
so that we just rate limit priority inversions every 3 seconds.
This commit improves the comments for PriorityInversionRequeue and clarifies the
contracts around action.Priority().
This commit refactors CheckPriorityInversion.
This commit adds the TestAllocatorPriorityInvariance test, which acts as a
regression safeguard when new actions are added to AllocatorAction, ensuring the
contract is upheld. See action.Priority() and ComputeAction() for more details
on the contract.
…equeue

Previously, we introduced the PriorityInversionRequeue cluster setting, intended
for backport, to handle cases where a range was enqueued with a high-priority
repair action but, at processing time, a low-priority rebalance action was
computed. In such cases, the caller re-adds the range to the queue under its
updated priority. Although the cluster setting guards this requeue behavior, the
inversion check always ran unconditionally, reducing backport safety. This
commit updates the logic so that the cluster setting guard both the inversion
check and the requeue behavior.
Previously, we checked for priority inversion before planning errors, which
meant we could return requeue = true even when a planning error occurred. This
commit changes it so that planning errors should take higher precedence over a
priority inversion error. rq.processOneChange now returns early if there is a
planning error and only check for priority inversion right before applying a
change.
Previously, we checked for requeue right before returning for both nil and
non-nil errors, making the code harder to follow. This commit refactors
replicateQueue.process to requeue replicas before checking for errors.
maybeBackpressureBatch registers a callback with the split queue for replicas
that are too large relative to their split size. This backpressures the range to
stop it from growing and prevent new writes from blocking a pending split. The
callback is invoked when the split queue finishes processing the replica.

Previously, the error channel used in the callback had a size of 1 and performed
blocking sends. This was safe because the base queue only sent a single error,
and by the time maybeBackpressureBatch returned, the callback was guaranteed to
have been consumed, and no additional sends would occur.

Future commits will allow the callback to be invoked multiple times (although it
should only twice at most). To be safe and avoid potential deadlocks from
multiple sends after maybeBackpressureBatch already returns, this commit makes
the error send non-blocking. If the channel is already full, the error is
dropped, which is acceptable since we only care about observing the completion
of the replica processing at least once.
baseQueue.Async may return immediately as a noop if the semaphore does not
available capacity and the wait parameter is false. Previously, this case
returned no error, leaving the caller unaware that the request was dropped. This
commit changes the behavior to return a baseQueueAsyncRateLimited error,
allowing callers to detect and handle the condition.
The base queue already supports registering callbacks that are invoked with the
processing result of replicas once they are processed. However, replicas may
fail before reaching that stage (e.g., failing to enqueue or dropped early).
This commit extends the mechanism to also report enqueue results, allowing
callers to detect failures earlier. Currently, only
decommissioningNudger.maybeEnqueueProblemRange uses this.

Note that one behavior change is introduced: previously, a registered callback
would fire only once with the processing result and not again if the replica was
later processed by the purgatory queue. With this change, the callback may now
be invoked twice.
This commit adds TestBaseQueueCallbackOnEnqueueResult and
TestBaseQueueCallbackOnProcessResult to verify that callbacks are correctly
invoked with both enqueue and process results.
Previously, we updated bq.enqueueAdd inside the defer statement of addInternal.
This was incorrect because we may return queued = true for a replica already
processing and was marked for requeue. That replica would later be requeued in
finishProcessingReplica, incrementing the metric again, lead to double counting.
…ueDecommissionScannerDisabled

his commit extends TestBaseQueueCallback* and
TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.
Previously, replicas could be enqueued at a high priority but end up processing
a lower-priority actions, causing priority inversion and unfairness to other
replicas behind them that needs a repair action. This commit adds metrics to
track such cases. In addition, this commit also adds metrics to track when
replicas are requeued in the replicate queue due to a priority inversion from a
repair action to a rebalance action.
Previously, we added priority inversion requeuing mechanism.

This commit adds a unit test that forces the race condition we suspected to be
happening in escalations involving priority inversion and asserts that priority
inversion occurs and that the replica is correctly requeued. Test set up:

1. range’s leaseholder replica is rebalanced from one store to another.
2. new leaseholder enqueues the replica for repair with high priority (e.g. to
finalize the atomic replication change or remove a learner replica)
3. before processing, the old leaseholder completes the change (exits the joint
   config or removes the learner).
4. when the new leaseholder processes the replica, it computes a
   ConsiderRebalance action, resulting in a priority inversion and potentially
   blocking other high-priority work.
This commit removes per-action priority inversion metrics due to their high
cardinality. We already have logging in place, which should provide sufficient
observability. For now, we care about is priority inversion that leads to
consider rebalance and requeuing the most.
@blathers-crl
Copy link

blathers-crl bot commented Sep 5, 2025

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes. Includes test-only changes, build system changes, etc.
  • Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl bot added backport Label PR's that are backports to older release branches T-kv KV Team labels Sep 5, 2025
@blathers-crl
Copy link

blathers-crl bot commented Sep 5, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@blathers-crl
Copy link

blathers-crl bot commented Sep 5, 2025

❌ PR #153062 does not comply with backport policy

Confidence: high
Explanation: The pull request modifies production code files in the 'pkg/kv/kvserver/' directory among others, which indicates changes that impact the production behavior of CockroachDB. The PR includes extensive modifications across a variety of files, which involve changes in the queue systems, metrics enhancements, and general behavior adjustments of the replicate queue among others. Despite a 'Release justification: critical fix (potential) and observability improvement for decommission stall' provided, the details given in the justification do not conclusively establish the presence of a critical bug as defined in the backport policies. Observability improvements alone, while valuable, do not meet the criteria required for backporting under critical bug fixes unless they directly contribute to fixing or diagnosing a critical bug. Additionally, there is no mention of feature flag implementation in the description or the changed files, which is necessary for non-critical changes.
Recommendation: Reconsider the backport given the nature of changes, or provide additional justification for why the changes address a critical bug. If non-critical, ensure changes are gated behind a feature flag.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6 wenyihu6 marked this pull request as ready for review September 5, 2025 18:35
@wenyihu6 wenyihu6 requested a review from a team as a code owner September 5, 2025 18:35
@wenyihu6 wenyihu6 requested review from arulajmani and tbg September 5, 2025 18:35
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 5, 2025

Putting here as a reminder to myself: we will need to backport #153008 as well.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 5, 2025

TFTR!

@wenyihu6 wenyihu6 merged commit b8345b7 into cockroachdb:release-24.3.20-rc Sep 5, 2025
15 of 16 checks passed
@wenyihu6 wenyihu6 deleted the backportrelease-24.3.20-rc-151898-152508-152512-152675-152507-152699-152596-152792-152885-152787-152697 branch September 24, 2025 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches T-kv KV Team v24.3.20

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants