-
Notifications
You must be signed in to change notification settings - Fork 4k
release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, the decommissioning nudger had limited observability, making it difficult to monitor its effectiveness and diagnose issues during node decommissioning operations. This was inadequate because operators couldn't track how many ranges were being enqueued for decommissioning, nor could they see when the nudger skipped ranges due to leaseholder status or invalid leases. To address this, this patch adds comprehensive logging and metrics: - Logs when the decommissioning nudger enqueues replicas with priority info - Tracks successful enqueues via DecommissioningNudgerEnqueueEnqueued metric - Tracks skipped enqueues via DecommissioningNudgerNotLeaseholderOrInvalidLease metric - Adds structured logging for debugging nudger behavior - Includes comprehensive test coverage for the new metrics TODO: - Figure out a better way to track decommissioning enqueue failures. Currently it's hard to "get as close as we can" to the source of the enqueue failures for logging & metrics purposes - this would require a better architecting of the code pathways to ensure we log and track failures as close as we can to where they occur Fixes: CRDB-51396 Release note: None
This commit refactors isDecommissionAction into allocatorimpl for consistency
with other similar helpers like allocatorActions.{Add,Replace,Remove}. This
change has no behavior changes but to make future commits easier.
This commit simplifies the logging in `maybeEnqueueProblemRange` to log two booleans directly.
Previously, the comment on the queue incorrectly stated that it removes the lowest-priority element when exceeding its maximum size. This was misleading because heap only guarantees that the root is the highest priority, not that elements are globally ordered. This commit updates the comment to clarify that the removed element might not be the lowest priority. Ideally, we should drop the lowest priority element when exceeding queue size, but heap doesn't make this very easy.
This commit adds logging for ranges dropped from the base queue due to exceeding max size, improving observability. The log is gated behind V(1) to avoid verbosity on nodes with many ranges.
This commit plumbs the enqueue time priority into baseQueue.processReplica, enabling comparison between the priority at enqueue time and at processing time. For now, we pass -1 in all cases except when processing replicas directly from the base queue, where -1 signals that priority verification should be skipped. No logic change has been made yet to check for priority inversion; future commits will extend processReplica to validate that processing priority has not differed significantly from the enqueue time priority.
Previously, a replicaItem’s priority was cleared when marked as processing, to indicate it was no longer in the priority queue. This behavior made sense when the purgatory queue did not track priorities. However, we now need to preserve priorities for items in purgatory as well since they will be calling into baseQueue.processReplica. This commit removes the priority reset in replicaItem.SetProcessing(), ensuring that the enqueue time priority is retained when replicas are popped from the heap and passed into the purgatory queue properly. No behavior change should happen from this change.
Previously, replica items in the purgatory queue did not retain their enqueue time priority. This commit ensures that the priority is preserved so it can be passed to baseQueue.processReplica when processing items from purgatory.
This commit adds an assertion to Allocator.ComputeAction to ensure that priority is never -1 in cases where it shouldn’t be. Normally, ComputeAction returns action.Priority(), but we sometimes adjust the priority for specific actions like AllocatorAddVoter, AllocatorRemoveDeadVoter, and AllocatorRemoveVoter. A priority of -1 is a special case reserved for processing logic to run even if there’s a priority inversion. If the priority is not -1, the range may be re-queued to be processed with the correct priority.
This commit adds additional invariants to verify the correctness of priority plumbing for range items in base queue tests.
This commit fixes an incorrect log statement in computeAction for priority assertions. The log was mistakenly emitted even when the priority was not -1. Related: cockroachdb#152512 Release note: none
Previously, we called bq.replicaCanBeProcessed with acquireLeaseIfNeeded = false before invoking bq.processReplica, which itself calls replicaCanBeProcessed with acquireLeaseIfNeeded = true. This looks incorrect and did not exist prior to cockroachdb@c9cf068. It’s unclear how often lease renewal is actually going to be helpful here, but I removed these two calls since they were newly introduced and seem unintentional. Informs: cockroachdb#151292 Release note: none
Previously, we had limited observability into when queues drop replicas due to reaching their maximum size. This commit adds a metric to track and observe such events.
Previously, the maximum base queue size was hardcoded to defaultQueueMaxSize (10000). Since replica item structs are small, there’s little reason to enforce a fixed limit. This commit makes the replicate queue size configurable via a cluster setting ReplicateQueueMaxSize, allowing incremental and backport-friendly adjustments. Note that reducing the setting does not drop replicas appropirately; future commits will address this behavior.
This commit adds tests to (1) verify metric updates when replicas are dropped from the queue, and (2) ensure the cluster setting for ReplicateQueueMaxSize works correctly.
Previously, the ReplicateQueueMaxSize cluster setting allowed dynamic adjustment of the replicate queue’s maximum size. However, decreasing this setting did not properly drop excess replicas. This commit fixes that by removing replicas when the queue’s max size is lowered.
This commit improves the clarity around the naming and description of the metrics.
This commit adds a new cluster setting PriorityInversionRequeue that controls whether the replicate queue should requeue replicas when their priority at enqueue time differs significantly from their priority at processing time (e.g. dropping from top 3 to the lowest priority).
Previously, a replica could enter the queue with high priority but, by the time it was processed, the action planned for this replica may have a low priority, causing us to perform low priority work. Specifically, we are mostly worried about cases where the priority changes from any of the repair actions to consider rebalance. Rebalancing could take a long time and block other ranges enqueued with actual repair action needed. This commit ensures that such replicas are requeued instead, avoiding priority inversions.
Previously, replicateQueue used V(2) to log info on priority inverted replicas because I wanted visibility into every case without missing any replicas. On reflection, the individual cases aren’t that interesting - it’s the overall volume that matters, which we can track through metrics. This commit changes it so that we just rate limit priority inversions every 3 seconds.
This commit improves the comments for PriorityInversionRequeue and clarifies the contracts around action.Priority().
This commit refactors CheckPriorityInversion.
This commit adds the TestAllocatorPriorityInvariance test, which acts as a regression safeguard when new actions are added to AllocatorAction, ensuring the contract is upheld. See action.Priority() and ComputeAction() for more details on the contract.
…equeue Previously, we introduced the PriorityInversionRequeue cluster setting, intended for backport, to handle cases where a range was enqueued with a high-priority repair action but, at processing time, a low-priority rebalance action was computed. In such cases, the caller re-adds the range to the queue under its updated priority. Although the cluster setting guards this requeue behavior, the inversion check always ran unconditionally, reducing backport safety. This commit updates the logic so that the cluster setting guard both the inversion check and the requeue behavior.
Previously, we checked for priority inversion before planning errors, which meant we could return requeue = true even when a planning error occurred. This commit changes it so that planning errors should take higher precedence over a priority inversion error. rq.processOneChange now returns early if there is a planning error and only check for priority inversion right before applying a change.
Previously, we checked for requeue right before returning for both nil and non-nil errors, making the code harder to follow. This commit refactors replicateQueue.process to requeue replicas before checking for errors.
maybeBackpressureBatch registers a callback with the split queue for replicas that are too large relative to their split size. This backpressures the range to stop it from growing and prevent new writes from blocking a pending split. The callback is invoked when the split queue finishes processing the replica. Previously, the error channel used in the callback had a size of 1 and performed blocking sends. This was safe because the base queue only sent a single error, and by the time maybeBackpressureBatch returned, the callback was guaranteed to have been consumed, and no additional sends would occur. Future commits will allow the callback to be invoked multiple times (although it should only twice at most). To be safe and avoid potential deadlocks from multiple sends after maybeBackpressureBatch already returns, this commit makes the error send non-blocking. If the channel is already full, the error is dropped, which is acceptable since we only care about observing the completion of the replica processing at least once.
baseQueue.Async may return immediately as a noop if the semaphore does not available capacity and the wait parameter is false. Previously, this case returned no error, leaving the caller unaware that the request was dropped. This commit changes the behavior to return a baseQueueAsyncRateLimited error, allowing callers to detect and handle the condition.
The base queue already supports registering callbacks that are invoked with the processing result of replicas once they are processed. However, replicas may fail before reaching that stage (e.g., failing to enqueue or dropped early). This commit extends the mechanism to also report enqueue results, allowing callers to detect failures earlier. Currently, only decommissioningNudger.maybeEnqueueProblemRange uses this. Note that one behavior change is introduced: previously, a registered callback would fire only once with the processing result and not again if the replica was later processed by the purgatory queue. With this change, the callback may now be invoked twice.
This commit adds TestBaseQueueCallbackOnEnqueueResult and TestBaseQueueCallbackOnProcessResult to verify that callbacks are correctly invoked with both enqueue and process results.
Previously, we updated bq.enqueueAdd inside the defer statement of addInternal. This was incorrect because we may return queued = true for a replica already processing and was marked for requeue. That replica would later be requeued in finishProcessingReplica, incrementing the metric again, lead to double counting.
…ueDecommissionScannerDisabled his commit extends TestBaseQueueCallback* and TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.
Previously, replicas could be enqueued at a high priority but end up processing a lower-priority actions, causing priority inversion and unfairness to other replicas behind them that needs a repair action. This commit adds metrics to track such cases. In addition, this commit also adds metrics to track when replicas are requeued in the replicate queue due to a priority inversion from a repair action to a rebalance action.
Previously, we added priority inversion requeuing mechanism. This commit adds a unit test that forces the race condition we suspected to be happening in escalations involving priority inversion and asserts that priority inversion occurs and that the replica is correctly requeued. Test set up: 1. range’s leaseholder replica is rebalanced from one store to another. 2. new leaseholder enqueues the replica for repair with high priority (e.g. to finalize the atomic replication change or remove a learner replica) 3. before processing, the old leaseholder completes the change (exits the joint config or removes the learner). 4. when the new leaseholder processes the replica, it computes a ConsiderRebalance action, resulting in a priority inversion and potentially blocking other high-priority work.
This commit removes per-action priority inversion metrics due to their high cardinality. We already have logging in place, which should provide sufficient observability. For now, we care about is priority inversion that leads to consider rebalance and requeuing the most.
|
Thanks for opening a backport. Before merging, please confirm that it falls into one of the following categories (select one):
Add a brief release justification to the PR description explaining your selection. Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy. All backports must be reviewed by the TL and EM for the owning area. |
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
❌ PR #153062 does not comply with backport policy Confidence: high 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
Putting here as a reminder to myself: we will need to backport #153008 as well. |
|
TFTR! |
b8345b7
into
cockroachdb:release-24.3.20-rc
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: critical potential fix and observability improvement for decommission stall