kvserver: requeue on priority inversion for replicate queue #152596

wenyihu6 · 2025-08-27T16:09:29Z

Part of: #152022
Release note: none

kvserver: add PriorityInversionRequeue

This commit adds a new cluster setting PriorityInversionRequeue that controls
whether the replicate queue should requeue replicas when their priority at
enqueue time differs significantly from their priority at processing time
(e.g. dropping from top 3 to the lowest priority).

kvserver: requeue on priority inversion for replicate queue

Previously, a replica could enter the queue with high priority but, by the time
it was processed, the action planned for this replica may have a low priority,
causing us to perform low priority work. Specifically, we are mostly worried
about cases where the priority changes from any of the repair actions to
consider rebalance. Rebalancing could take a long time and block other ranges
enqueued with actual repair action needed. This commit ensures that such
replicas are requeued instead, avoiding priority inversions.

kvserver: use priorityInversionLogEveryN

Previously, replicateQueue used V(2) to log info on priority inverted replicas
because I wanted visibility into every case without missing any replicas. On
reflection, the individual cases aren’t that interesting - it’s the overall
volume that matters, which we can track through metrics. This commit changes it
so that we just rate limit priority inversions every 3 seconds.

kvserver: improve comments for PriorityInversionRequeue

This commit improves the comments for PriorityInversionRequeue and clarifies the
contracts around action.Priority().

allocator: small refactor for CheckPriorityInversion

This commit refactors CheckPriorityInversion.

allocator: add TestAllocatorPriorityInvariance

This commit adds the TestAllocatorPriorityInvariance test, which acts as a
regression safeguard when new actions are added to AllocatorAction, ensuring the
contract is upheld. See action.Priority() and ComputeAction() for more details
on the contract.

kvserver: guard inversion check and requeue behind PriorityInversionRequeue

Previously, we introduced the PriorityInversionRequeue cluster setting, intended
for backport, to handle cases where a range was enqueued with a high-priority
repair action but, at processing time, a low-priority rebalance action was
computed. In such cases, the caller re-adds the range to the queue under its
updated priority. Although the cluster setting guards this requeue behavior, the
inversion check always ran unconditionally, reducing backport safety. This
commit updates the logic so that the cluster setting guard both the inversion
check and the requeue behavior.

kvserver: move priority inversion check before applyChange

Previously, we checked for priority inversion before planning errors, which
meant we could return requeue = true even when a planning error occurred. This
commit changes it so that planning errors should take higher precedence over a
priority inversion error. rq.processOneChange now returns early if there is a
planning error and only check for priority inversion right before applying a
change.

kvserver: check for requeue before error checking in rq.process

Previously, we checked for requeue right before returning for both nil and
non-nil errors, making the code harder to follow. This commit refactors
replicateQueue.process to requeue replicas before checking for errors.

cockroach-teamcity · 2025-08-27T16:09:43Z

This change is

wenyihu6 · 2025-08-27T18:53:07Z

@tbg @sumeerbhola Tagged you as reviewer since we discussed this and wanted to keep you in the loop. No pressure to actually review it.

sumeerbhola

@sumeerbhola reviewed 1 of 1 files at r1, 1 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @tbg)

pkg/kv/kvserver/replicate_queue.go line 107 at r1 (raw file):

// PriorityInversionRequeue is a setting that controls whether to requeue
// replicas when their priority at enqueue time and processing time is inverted
// too much (e.g. dropping from a repair cation to AllocatorConsiderRebalance).

nit: action

pkg/kv/kvserver/replicate_queue.go line 902 at r2 (raw file):

		log.KvDistribution.Infof(ctx,
			"priority inversion during process: shouldRequeue = %t action=%s, priority=%v, enqueuePriority=%v",
			shouldRequeue, change.Action, change.Action.Priority(), priorityAtEnqueue)

Can this be spammy. Should it use log.EveryN?

pkg/kv/kvserver/replicate_queue.go line 904 at r2 (raw file):

			shouldRequeue, change.Action, change.Action.Priority(), priorityAtEnqueue)
	}
	if PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {

This could be nested inside if inversion if the question about the invariant below is true.
And could structure this as:

if inversion {
  ...
  if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {
     ...
  }
}

to avoid reading the cluster setting in the common case.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3262 at r2 (raw file):

// returns whether there was a priority inversion and whether the caller should
// skip the processing of the range since the inversion is considered unfair.
// Currently, we only consider the inversion as unfair if it has went from a

nit: s/went/gone/

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3264 at r2 (raw file):

// Currently, we only consider the inversion as unfair if it has went from a
// repair action to lowest priority (AllocatorConsiderRebalance). We let
// AllocatorRangeUnavailable, AllocatorNoop pass through since they are noop.

Is there an invariant: shouldRequeue => isInversion?
If yes, please document it.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3272 at r2 (raw file):

	// a very high priority (1e5). In those cases, we do not want to requeue
	// these actions or count it as an inversion.
	if priorityAtEnqueue == -1 || !withinPriorityRange(priorityAtEnqueue) {

what is the -1 case?

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @sumeerbhola, and @tbg)

pkg/kv/kvserver/replicate_queue.go line 107 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: action

Fixed.

pkg/kv/kvserver/replicate_queue.go line 902 at r2 (raw file):

Previously, sumeerbhola wrote…

Can this be spammy. Should it use log.EveryN?

Done for every 3 seconds.

pkg/kv/kvserver/replicate_queue.go line 904 at r2 (raw file):

Previously, sumeerbhola wrote…

This could be nested inside if inversion if the question about the invariant below is true.
And could structure this as:
if inversion {
  ...
  if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {
     ...
  }
}
to avoid reading the cluster setting in the common case.

Good call, done.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3262 at r2 (raw file):

Previously, sumeerbhola wrote…

nit: s/went/gone/

Done.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3264 at r2 (raw file):

Previously, sumeerbhola wrote…

Is there an invariant: shouldRequeue => isInversion?
If yes, please document it.

Added a note.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3272 at r2 (raw file):

Previously, sumeerbhola wrote…

what is the -1 case?

Added a comment to explain this:

	// priorityAtEnqueue of -1 is a special case reserved for processing logic to
	// run even if there’s a priority inversion. If the priority is not -1, the
	// range may be re-queued to be processed with the correct priority. It is
	// used for things that call into baseQueue.process without going through the
	// replicate queue. For example, s.ReplicateQueueDryRun or
	// r.scatterRangeAndRandomizeLeases.

See more details in #152512 which plumbs these -1. This PR has been merged, but let me know if you have any comments and I can open another PR to address them.

sumeerbhola

Someone who knows this code better should approve

@sumeerbhola reviewed 1 of 1 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @tbg, and @wenyihu6)

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3272 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Added a comment to explain this:
	// priorityAtEnqueue of -1 is a special case reserved for processing logic to
	// run even if there’s a priority inversion. If the priority is not -1, the
	// range may be re-queued to be processed with the correct priority. It is
	// used for things that call into baseQueue.process without going through the
	// replicate queue. For example, s.ReplicateQueueDryRun or
	// r.scatterRangeAndRandomizeLeases.
See more details in #152512 which plumbs these -1. This PR has been merged, but let me know if you have any comments and I can open another PR to address them.

Ack

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3266 at r6 (raw file):

// AllocatorRangeUnavailable, AllocatorNoop pass through since they are noop.
//
// NB: If shouldRequeue is true, isInversion must be true.

nit: I think it is easier for the reader to say something less verbose like
// INVARIANT: shouldRequeue => isInversion

tbg

Flushing some comments. Generally looks good! The priority rounding is a bit of a rangle, but I don't see how to easily avoid that (except if we allow more inversions).

tbg · 2025-08-28T13:16:06Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+// WithinPriorityRange checks if a priority is within the range of possible
+// priorities for the allocator actions.
+func withinPriorityRange(priority float64) bool {
+	return AllocatorNoop.Priority() <= priority && priority <= AllocatorFinalizeAtomicReplicationChange.Priority()


it would be cleaner to add a sentinel value to the enum and to check on it as the upper bound here.

Also, consider inlining this function. It doesn't make much sense when looked at without context, and it's only used in one place. It also isn't being tested directly, nor would that make enough sense.

Added AllocatorMaxPriority for the upper bound and added a NB on AllocatorNoop should have the lowest priority. Also put this function under CheckPriorityInversion.

tbg · 2025-08-28T13:22:07Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+	priorityAtEnqueue float64, actionAtProcessing AllocatorAction,
+) (isInversion bool, shouldRequeue bool) {
+	// priorityAtEnqueue of -1 is a special case reserved for processing logic to
+	// run even if there’s a priority inversion. If the priority is not -1, the


This middle sentence ("If the priority is not -1, ...") is a bit of a non sequitur. The surrounding paragraphs explain the special case of -1. I'd just say something along the lines of

// NB: priorityAtEnqueue is -1 for callers such as scatter, dry runs, and manual queue runs.
// Priority inversion does not apply to these calls.

tbg · 2025-08-28T13:23:32Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+// CheckPriorityInversion checks if the priority at enqueue time is higher than
+// the priority corresponding to the action computed at processing time. It
+// returns whether there was a priority inversion and whether the caller should
+// skip the processing of the range since the inversion is considered unfair.


isn't it "It returns whether there was a priority inversion (and the range should not be processed at this time, since doing so could starve higher-priority items), and whether the caller should re-add the range to the queue (presumably under its new priority)."

Clarified the comment.

tbg · 2025-08-28T13:26:46Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+	// AllocatorRemoveLearner, and AllocatorReplaceDeadVoter will be rounded to
+	// the same priority. They are so close to each other, so we don't really
+	// count it as an inversion among them.
+	normPriorityAtEnqueue := roundToNearestPriorityCategory(priorityAtEnqueue)


Can you update the priority enum with a hard-to-miss message that priorities should be "separated" by multiples of 100 (explaining why) and also update the places where we adjust the priority to explain why each adjustment is ok?

Added a NB comment and a regression unit test. I noticed that we are making some assumptions with this statement, but I think the assumptions should be true. (Update: After some thinking, I don't know if I can say the second assumption should be true. We can revisit them.)

Update: After some thinking, I don't know if I can say the assumptions (especially the second one) should be true. I should revisit them if strong guarantees are required. It should be fine for now since priority inversion re-queuing only care about repair action v.s. consider rebalance. I added a comment to warn new uses.

You're saying that with sufficiently high replication factors, some of the priorities may be adjusted by more than 50, right? I can live with this, add a comment about it and let's move on.

tbg · 2025-08-28T13:27:34Z

pkg/kv/kvserver/replicate_queue.go

+	"kv.priority_inversion_requeue_replicate_queue.enabled",
+	"whether the requeue replicas should requeue when enqueued for "+
+		"repair action but ended up consider rebalancing during processing",
+	false,


then we land a separate PR on master that flips this to true (so that we get some exposure), and backport only this PR (keep default value false) if we backport?

Good call, added a TODO to remind myself after this PR lands.

Put up #152749 which stacks on top of this PR.

tbg

@tbg dismissed @sumeerbhola from 3 discussions.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @sumeerbhola, and @wenyihu6)

pkg/kv/kvserver/replicate_queue.go line 910 at r6 (raw file):

		}

		if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {

It would be good to backport something that is a no-op unless opted into. But this code is not a no-op, the user can only control whether to requeue or skip. Should we make the cluster setting about enabling the inversion detection mechanism as a whole, and not have a way to toggle just the requeue? That way, we can backport the thing as off, our customer can turn it on, and we also turn it on on master so that it will ship with the next release for everyone else. If the requeuing becomes a problem, we'll find out and users can toggle off the entire inversion mechanism as a temporary work-around.

pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

		if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {
			// Return true here to requeue the range. We can't return an error here

We can't return an error but we return an error? Something is off here. I agree (reading the code) that requeuing has no effect if an error is returned.

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani, @sumeerbhola, and @tbg)

pkg/kv/kvserver/replicate_queue.go line 910 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It would be good to backport something that is a no-op unless opted into. But this code is not a no-op, the user can only control whether to requeue or skip. Should we make the cluster setting about enabling the inversion detection mechanism as a whole, and not have a way to toggle just the requeue? That way, we can backport the thing as off, our customer can turn it on, and we also turn it on on master so that it will ship with the next release for everyone else. If the requeuing becomes a problem, we'll find out and users can toggle off the entire inversion mechanism as a temporary work-around.

Added a commit to make PriorityInversionRequeue guard both inversion check and requeuing.

pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We can't return an error but we return an error? Something is off here. I agree (reading the code) that requeuing has no effect if an error is returned.

Stale comment, updated for clarity. Originally, I considered returning requeue = true, err = nil to allow requeuing on error, but instead chose this approach to ensure correct logging and error metric tracking in processReplica and finishProcessingReplica.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3266 at r6 (raw file):

Previously, sumeerbhola wrote…

nit: I think it is easier for the reader to say something less verbose like
// INVARIANT: shouldRequeue => isInversion

Done.

wenyihu6 · 2025-08-29T19:23:13Z

pkg/kv/kvserver/replicate_queue.go

+	"kv.priority_inversion_requeue_replicate_queue.enabled",
+	"whether the requeue replicas should requeue when enqueued for "+
+		"repair action but ended up consider rebalancing during processing",
+	false,


Good call, added a TODO to remind myself after this PR lands.

wenyihu6 · 2025-08-29T19:23:13Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+// WithinPriorityRange checks if a priority is within the range of possible
+// priorities for the allocator actions.
+func withinPriorityRange(priority float64) bool {
+	return AllocatorNoop.Priority() <= priority && priority <= AllocatorFinalizeAtomicReplicationChange.Priority()


Added AllocatorMaxPriority for the upper bound and added a NB on AllocatorNoop should have the lowest priority. Also put this function under CheckPriorityInversion.

wenyihu6 · 2025-08-29T19:23:13Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+// CheckPriorityInversion checks if the priority at enqueue time is higher than
+// the priority corresponding to the action computed at processing time. It
+// returns whether there was a priority inversion and whether the caller should
+// skip the processing of the range since the inversion is considered unfair.


Clarified the comment.

wenyihu6 · 2025-08-29T19:23:14Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+	priorityAtEnqueue float64, actionAtProcessing AllocatorAction,
+) (isInversion bool, shouldRequeue bool) {
+	// priorityAtEnqueue of -1 is a special case reserved for processing logic to
+	// run even if there’s a priority inversion. If the priority is not -1, the


wenyihu6 · 2025-08-29T19:23:14Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+	// AllocatorRemoveLearner, and AllocatorReplaceDeadVoter will be rounded to
+	// the same priority. They are so close to each other, so we don't really
+	// count it as an inversion among them.
+	normPriorityAtEnqueue := roundToNearestPriorityCategory(priorityAtEnqueue)


Added a NB comment and a regression unit test. I noticed that we are making some assumptions with this statement, but I think the assumptions should be true. (Update: After some thinking, I don't know if I can say the second assumption should be true. We can revisit them.)

wenyihu6 · 2025-08-29T19:29:36Z

pkg/kv/kvserver/replicate_queue.go

+	"kv.priority_inversion_requeue_replicate_queue.enabled",
+	"whether the requeue replicas should requeue when enqueued for "+
+		"repair action but ended up consider rebalancing during processing",
+	false,


Put up #152749 which stacks on top of this PR.

tbg

Almost there!

@tbg reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 1 of 1 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, all commit messages.
@tbg dismissed @sumeerbhola from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @wenyihu6)

pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Stale comment, updated for clarity. Originally, I considered returning requeue = true, err = nil to allow requeuing on error, but instead chose this approach to ensure correct logging and error metric tracking in processReplica and finishProcessingReplica.

But won't requeue=true be ignored if there is an error? This gets returned up here

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 699 to 701 in 0a31197

    
           if err != nil { 
        
           	return false, err 
        
           }

and requeue is never acted upon. So what's the point of giving the impression that requeuing will happen when it isn't actually happening? IMO we shouldn't enter this branch on error at all, an error when computing a plan is not an inversion.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 263 at r7 (raw file):

// AllocatorRemoveDeadVoter, and AllocatorRemoveVoter, the priority may be
// adjusted (see ComputeAction for details), but the adjustment is expected to
// be small (<100).

<49, right?

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

// NB: The returned priority may include a small adjustment and therefore might
// not exactly match action.Priority(). See AllocatorAddVoter,
// AllocatorRemoveDeadVoter, AllocatorRemoveVoter below. The adjustment should be <100.

<49

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

// NB: The returned priority may include a small adjustment and therefore might
// not exactly match action.Priority(). See AllocatorAddVoter,
// AllocatorRemoveDeadVoter, AllocatorRemoveVoter below. The adjustment should be <100.

// The adjustment should be <49 (see AllocatorAction.Priority()).

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 997 at r7 (raw file):

// AllocatorRemoveDeadVoter, AllocatorRemoveVoter below. The adjustment should be <100.
//
// The claim that the adjustment is < 100 has two assumptions:

49 and the other numbers below also need touching up.

tbg · 2025-09-01T08:52:37Z

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go

+	// AllocatorRemoveLearner, and AllocatorReplaceDeadVoter will be rounded to
+	// the same priority. They are so close to each other, so we don't really
+	// count it as an inversion among them.
+	normPriorityAtEnqueue := roundToNearestPriorityCategory(priorityAtEnqueue)


You're saying that with sufficiently high replication factors, some of the priorities may be adjusted by more than 50, right? I can live with this, add a comment about it and let's move on.

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @tbg)

pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

won't requeue=true be ignored if there is an error?

I changed it so that we may requeue on errors as well in this commit:

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 698 to 704 in d75f145

    
           if err != nil { 
        
           	if requeue { 
        
           		log.KvDistribution.VEventf(ctx, 1, "re-queuing on errors: %v", err) 
        
           		rq.maybeAdd(ctx, repl, rq.store.Clock().NowAsClockTimestamp()) 
        
           	} 
        
           	return false, err 
        
           }

. Without this change, requeuing wouldn’t occur unless we return requeue = true, err = nil instead. My intention was to treat priority inversion as an error (for logging and metrics in finishProcessingReplica) while still making it requeueable. Does this seem like a right decision to you?

Good call on we shouldn't enter this branch on error at all. I moved the priority inversion check down below so that we return early if there is a planning error to begin with.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 263 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

<49, right?

Good catch, done.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

<49

Fixed.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// The adjustment should be <49 (see AllocatorAction.Priority()).

Fixed.

pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 997 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

49 and the other numbers below also need touching up.

Fixed.

tbg

@tbg reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @wenyihu6)

pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

won't requeue=true be ignored if there is an error?

I changed it so that we may requeue on errors as well in this commit:

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 698 to 704 in d75f145

if err != nil {

if requeue {

log.KvDistribution.VEventf(ctx, 1, "re-queuing on errors: %v", err)

rq.maybeAdd(ctx, repl, rq.store.Clock().NowAsClockTimestamp())

}

return false, err

}

. Without this change, requeuing wouldn’t occur unless we return requeue = true, err = nil instead. My intention was to treat priority inversion as an error (for logging and metrics in finishProcessingReplica) while still making it requeueable. Does this seem like a right decision to you?

Good call on we shouldn't enter this branch on error at all. I moved the priority inversion check down below so that we return early if there is a planning error to begin with.

Oh, I missed that - thank you. The code change looks innocuous enough, so I'm okay with this semantic change as long as we are sure it's only getting hit for this new error return path. I'll resolve this but left a comment on the requeue logic.

pkg/kv/kvserver/replicate_queue.go line 700 at r15 (raw file):

		if err != nil {
			if requeue {

Could this code be restructured as follows:

if requeue { ... }
if err != nil { return false, err }
if testingAggressive.... {}

// deleted the requeue branch that was here
return true, nil

You could also add a comment on why we're requeuing in both cases.

I assume you are confident we weren't requeuing on any other errors before, right? With an eye towards backporting that would be good to know for sure.

tbg

@tbg reviewed 1 of 1 files at r14, 1 of 1 files at r15.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @wenyihu6)

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @tbg)

pkg/kv/kvserver/replicate_queue.go line 700 at r15 (raw file):
Would you be okay if I addressed this in a follow-up PR? This one is getting long, and I assume we’d prefer not to backport the refactoring. I can add the comments now if you prefer.

I assume you are confident we weren't requeuing on any other errors before, right?
Yes, this is the only place before where we return requeue = true

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 879 to 882 in ce69299

// Otherwise, requeue to see if there is more work to do. As the

// operation succeeded and was planned for a repair action i.e. not

// rebalancing.

requeue = true

and it's with a nil error

cockroach/pkg/kv/kvserver/replicate_queue.go

Line 972 in ce69299

return ShouldRequeue(ctx, change, conf), nil

.

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @tbg)

pkg/kv/kvserver/replicate_queue.go line 700 at r15 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Would you be okay if I addressed this in a follow-up PR? This one is getting long, and I assume we’d prefer not to backport the refactoring. I can add the comments now if you prefer.

I assume you are confident we weren't requeuing on any other errors before, right?
Yes, this is the only place before where we return requeue = true

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 879 to 882 in ce69299

// Otherwise, requeue to see if there is more work to do. As the

// operation succeeded and was planned for a repair action i.e. not

// rebalancing.

requeue = true

and it's with a nil error

cockroach/pkg/kv/kvserver/replicate_queue.go

Line 972 in ce69299

return ShouldRequeue(ctx, change, conf), nil

.

I ended up just doing the refactoring in this PR. Doesn't look too complicated to backport.

tbg

Nice work!

@tbg reviewed 1 of 1 files at r16, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)

pkg/kv/kvserver/replicate_queue.go line 698 at r16 (raw file):
Just a style nit: rather than "claiming" that only a particular case hits this (which may just not be true forever and which might confuse the reader who may have no context on where or why that happens), I'd say something like

// Most requeuing operations return a nil error, but we intentionally also requeue on error, if this is requested. At the time of writing, priority inversions requeue while also returning an error.

This commit adds a new cluster setting PriorityInversionRequeue that controls whether the replicate queue should requeue replicas when their priority at enqueue time differs significantly from their priority at processing time (e.g. dropping from top 3 to the lowest priority).

Previously, a replica could enter the queue with high priority but, by the time it was processed, the action planned for this replica may have a low priority, causing us to perform low priority work. Specifically, we are mostly worried about cases where the priority changes from any of the repair actions to consider rebalance. Rebalancing could take a long time and block other ranges enqueued with actual repair action needed. This commit ensures that such replicas are requeued instead, avoiding priority inversions.

Previously, replicateQueue used V(2) to log info on priority inverted replicas because I wanted visibility into every case without missing any replicas. On reflection, the individual cases aren’t that interesting - it’s the overall volume that matters, which we can track through metrics. This commit changes it so that we just rate limit priority inversions every 3 seconds.

This commit improves the comments for PriorityInversionRequeue and clarifies the contracts around action.Priority().

This commit refactors CheckPriorityInversion.

This commit adds the TestAllocatorPriorityInvariance test, which acts as a regression safeguard when new actions are added to AllocatorAction, ensuring the contract is upheld. See action.Priority() and ComputeAction() for more details on the contract.

…equeue Previously, we introduced the PriorityInversionRequeue cluster setting, intended for backport, to handle cases where a range was enqueued with a high-priority repair action but, at processing time, a low-priority rebalance action was computed. In such cases, the caller re-adds the range to the queue under its updated priority. Although the cluster setting guards this requeue behavior, the inversion check always ran unconditionally, reducing backport safety. This commit updates the logic so that the cluster setting guard both the inversion check and the requeue behavior.

Previously, we checked for priority inversion before planning errors, which meant we could return requeue = true even when a planning error occurred. This commit changes it so that planning errors should take higher precedence over a priority inversion error. rq.processOneChange now returns early if there is a planning error and only check for priority inversion right before applying a change.

Previously, we checked for requeue right before returning for both nil and non-nil errors, making the code harder to follow. This commit refactors replicateQueue.process to requeue replicas before checking for errors.

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @tbg)

pkg/kv/kvserver/replicate_queue.go line 698 at r16 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just a style nit: rather than "claiming" that only a particular case hits this (which may just not be true forever and which might confuse the reader who may have no context on where or why that happens), I'd say something like

// Most requeuing operations return a nil error, but we intentionally also requeue on error, if this is requested. At the time of writing, priority inversions requeue while also returning an error.

Good point, done.

wenyihu6 · 2025-09-03T14:32:22Z

TFTR!

bors r=tbg

craig · 2025-09-03T14:32:30Z

👎 Rejected by code reviews

Got Sumeer's approval to do so since Sumeer already gave his stamp but left the final approval to others.

wenyihu6 · 2025-09-03T14:53:38Z

TFTR!

bors r=tbg

craig · 2025-09-03T15:46:17Z

Build succeeded:

wenyihu6 force-pushed the newpriorityinversion branch 3 times, most recently from 60f2038 to d75f145 Compare August 27, 2025 18:49

wenyihu6 changed the title ~~kvserver: add cluster setting~~ kvserver: requeue on priority inversion for replicate queue Aug 27, 2025

wenyihu6 marked this pull request as ready for review August 27, 2025 18:50

wenyihu6 requested review from a team as code owners August 27, 2025 18:50

wenyihu6 requested review from arulajmani, sumeerbhola and tbg August 27, 2025 18:52

wenyihu6 mentioned this pull request Aug 27, 2025

kvserver: track priority inversion in replicate queue metrics #152624

Closed

sumeerbhola previously requested changes Aug 28, 2025

View reviewed changes

wenyihu6 commented Aug 28, 2025

View reviewed changes

wenyihu6 requested a review from sumeerbhola August 28, 2025 02:26

sumeerbhola reviewed Aug 28, 2025

View reviewed changes

tbg reviewed Aug 28, 2025

View reviewed changes

wenyihu6 commented Aug 29, 2025

View reviewed changes

wenyihu6 requested a review from tbg August 29, 2025 19:23

wenyihu6 mentioned this pull request Aug 29, 2025

kvserver: default PriorityInversionRequeue to true #152749

Merged

wenyihu6 commented Aug 29, 2025

View reviewed changes

wenyihu6 mentioned this pull request Aug 30, 2025

kvserver: track priority inversion in replicate queue metrics #152697

Merged

tbg requested changes Sep 1, 2025

View reviewed changes

wenyihu6 commented Sep 1, 2025

View reviewed changes

wenyihu6 requested a review from tbg September 2, 2025 01:36

tbg reviewed Sep 2, 2025

View reviewed changes

wenyihu6 commented Sep 2, 2025

View reviewed changes

wenyihu6 requested a review from tbg September 3, 2025 00:42

tbg approved these changes Sep 3, 2025

View reviewed changes

wenyihu6 added 9 commits September 3, 2025 08:59

kvserver: improve comments for PriorityInversionRequeue

763ecad

This commit improves the comments for PriorityInversionRequeue and clarifies the contracts around action.Priority().

allocator: small refactor for CheckPriorityInversion

ad4d147

This commit refactors CheckPriorityInversion.

kvserver: check for requeue before error checking in rq.process

b632b95

Previously, we checked for requeue right before returning for both nil and non-nil errors, making the code harder to follow. This commit refactors replicateQueue.process to requeue replicas before checking for errors.

wenyihu6 force-pushed the newpriorityinversion branch from 1fcb9b8 to b632b95 Compare September 3, 2025 12:59

wenyihu6 commented Sep 3, 2025

View reviewed changes

craig bot merged commit 9e0ba6b into cockroachdb:master Sep 3, 2025
24 checks passed

celeste-cockroachdb bot added the target-release-25.4.0 label Sep 3, 2025

This was referenced Sep 5, 2025

release-25.2.6-rc: kvserver: requeue on priority inversion for replicate queue #153052

Merged

release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062

Merged

celeste-cockroachdb bot added v25.4.0-prerelease and removed target-release-25.4.0 labels Sep 22, 2025

	if err != nil {
	if requeue {
	log.KvDistribution.VEventf(ctx, 1, "re-queuing on errors: %v", err)
	rq.maybeAdd(ctx, repl, rq.store.Clock().NowAsClockTimestamp())
	}
	return false, err
	}

	// Otherwise, requeue to see if there is more work to do. As the
	// operation succeeded and was planned for a repair action i.e. not
	// rebalancing.
	requeue = true

kvserver: requeue on priority inversion for replicate queue #152596

kvserver: requeue on priority inversion for replicate queue #152596

Uh oh!

Conversation

wenyihu6 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Aug 27, 2025

Uh oh!

wenyihu6 commented Aug 27, 2025

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenyihu6 Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenyihu6 Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenyihu6 Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Aug 27, 2025 •

edited

Loading

wenyihu6 Aug 29, 2025 •

edited

Loading

wenyihu6 Aug 31, 2025 •

edited

Loading

wenyihu6 Aug 29, 2025 •

edited

Loading