Skip to content

Conversation

@wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Aug 27, 2025

Part of: #152022
Release note: none


kvserver: add PriorityInversionRequeue

This commit adds a new cluster setting PriorityInversionRequeue that controls
whether the replicate queue should requeue replicas when their priority at
enqueue time differs significantly from their priority at processing time
(e.g. dropping from top 3 to the lowest priority).


kvserver: requeue on priority inversion for replicate queue

Previously, a replica could enter the queue with high priority but, by the time
it was processed, the action planned for this replica may have a low priority,
causing us to perform low priority work. Specifically, we are mostly worried
about cases where the priority changes from any of the repair actions to
consider rebalance. Rebalancing could take a long time and block other ranges
enqueued with actual repair action needed. This commit ensures that such
replicas are requeued instead, avoiding priority inversions.


kvserver: use priorityInversionLogEveryN

Previously, replicateQueue used V(2) to log info on priority inverted replicas
because I wanted visibility into every case without missing any replicas. On
reflection, the individual cases aren’t that interesting - it’s the overall
volume that matters, which we can track through metrics. This commit changes it
so that we just rate limit priority inversions every 3 seconds.


kvserver: improve comments for PriorityInversionRequeue

This commit improves the comments for PriorityInversionRequeue and clarifies the
contracts around action.Priority().


allocator: small refactor for CheckPriorityInversion

This commit refactors CheckPriorityInversion.


allocator: add TestAllocatorPriorityInvariance

This commit adds the TestAllocatorPriorityInvariance test, which acts as a
regression safeguard when new actions are added to AllocatorAction, ensuring the
contract is upheld. See action.Priority() and ComputeAction() for more details
on the contract.


kvserver: guard inversion check and requeue behind PriorityInversionRequeue

Previously, we introduced the PriorityInversionRequeue cluster setting, intended
for backport, to handle cases where a range was enqueued with a high-priority
repair action but, at processing time, a low-priority rebalance action was
computed. In such cases, the caller re-adds the range to the queue under its
updated priority. Although the cluster setting guards this requeue behavior, the
inversion check always ran unconditionally, reducing backport safety. This
commit updates the logic so that the cluster setting guard both the inversion
check and the requeue behavior.


kvserver: move priority inversion check before applyChange

Previously, we checked for priority inversion before planning errors, which
meant we could return requeue = true even when a planning error occurred. This
commit changes it so that planning errors should take higher precedence over a
priority inversion error. rq.processOneChange now returns early if there is a
planning error and only check for priority inversion right before applying a
change.


kvserver: check for requeue before error checking in rq.process

Previously, we checked for requeue right before returning for both nil and
non-nil errors, making the code harder to follow. This commit refactors
replicateQueue.process to requeue replicas before checking for errors.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the newpriorityinversion branch 3 times, most recently from 60f2038 to d75f145 Compare August 27, 2025 18:49
@wenyihu6 wenyihu6 changed the title kvserver: add cluster setting kvserver: requeue on priority inversion for replicate queue Aug 27, 2025
@wenyihu6 wenyihu6 marked this pull request as ready for review August 27, 2025 18:50
@wenyihu6 wenyihu6 requested review from a team as code owners August 27, 2025 18:50
@wenyihu6
Copy link
Contributor Author

@tbg @sumeerbhola Tagged you as reviewer since we discussed this and wanted to keep you in the loop. No pressure to actually review it.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 1 of 1 files at r1, 1 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @tbg)


pkg/kv/kvserver/replicate_queue.go line 107 at r1 (raw file):

// PriorityInversionRequeue is a setting that controls whether to requeue
// replicas when their priority at enqueue time and processing time is inverted
// too much (e.g. dropping from a repair cation to AllocatorConsiderRebalance).

nit: action


pkg/kv/kvserver/replicate_queue.go line 902 at r2 (raw file):

		log.KvDistribution.Infof(ctx,
			"priority inversion during process: shouldRequeue = %t action=%s, priority=%v, enqueuePriority=%v",
			shouldRequeue, change.Action, change.Action.Priority(), priorityAtEnqueue)

Can this be spammy. Should it use log.EveryN?


pkg/kv/kvserver/replicate_queue.go line 904 at r2 (raw file):

			shouldRequeue, change.Action, change.Action.Priority(), priorityAtEnqueue)
	}
	if PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {

This could be nested inside if inversion if the question about the invariant below is true.
And could structure this as:

if inversion {
  ...
  if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {
     ...
  }
}

to avoid reading the cluster setting in the common case.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3262 at r2 (raw file):

// returns whether there was a priority inversion and whether the caller should
// skip the processing of the range since the inversion is considered unfair.
// Currently, we only consider the inversion as unfair if it has went from a

nit: s/went/gone/


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3264 at r2 (raw file):

// Currently, we only consider the inversion as unfair if it has went from a
// repair action to lowest priority (AllocatorConsiderRebalance). We let
// AllocatorRangeUnavailable, AllocatorNoop pass through since they are noop.

Is there an invariant: shouldRequeue => isInversion?
If yes, please document it.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3272 at r2 (raw file):

	// a very high priority (1e5). In those cases, we do not want to requeue
	// these actions or count it as an inversion.
	if priorityAtEnqueue == -1 || !withinPriorityRange(priorityAtEnqueue) {

what is the -1 case?

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replicate_queue.go line 107 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: action

Fixed.


pkg/kv/kvserver/replicate_queue.go line 902 at r2 (raw file):

Previously, sumeerbhola wrote…

Can this be spammy. Should it use log.EveryN?

Done for every 3 seconds.


pkg/kv/kvserver/replicate_queue.go line 904 at r2 (raw file):

Previously, sumeerbhola wrote…

This could be nested inside if inversion if the question about the invariant below is true.
And could structure this as:

if inversion {
  ...
  if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {
     ...
  }
}

to avoid reading the cluster setting in the common case.

Good call, done.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3262 at r2 (raw file):

Previously, sumeerbhola wrote…

nit: s/went/gone/

Done.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3264 at r2 (raw file):

Previously, sumeerbhola wrote…

Is there an invariant: shouldRequeue => isInversion?
If yes, please document it.

Added a note.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3272 at r2 (raw file):

Previously, sumeerbhola wrote…

what is the -1 case?

Added a comment to explain this:

	// priorityAtEnqueue of -1 is a special case reserved for processing logic to
	// run even if there’s a priority inversion. If the priority is not -1, the
	// range may be re-queued to be processed with the correct priority. It is
	// used for things that call into baseQueue.process without going through the
	// replicate queue. For example, s.ReplicateQueueDryRun or
	// r.scatterRangeAndRandomizeLeases.

See more details in #152512 which plumbs these -1. This PR has been merged, but let me know if you have any comments and I can open another PR to address them.

@wenyihu6 wenyihu6 requested a review from sumeerbhola August 28, 2025 02:26
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Someone who knows this code better should approve

@sumeerbhola reviewed 1 of 1 files at r5, 1 of 1 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @tbg, and @wenyihu6)


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3272 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Added a comment to explain this:

	// priorityAtEnqueue of -1 is a special case reserved for processing logic to
	// run even if there’s a priority inversion. If the priority is not -1, the
	// range may be re-queued to be processed with the correct priority. It is
	// used for things that call into baseQueue.process without going through the
	// replicate queue. For example, s.ReplicateQueueDryRun or
	// r.scatterRangeAndRandomizeLeases.

See more details in #152512 which plumbs these -1. This PR has been merged, but let me know if you have any comments and I can open another PR to address them.

Ack


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3266 at r6 (raw file):

// AllocatorRangeUnavailable, AllocatorNoop pass through since they are noop.
//
// NB: If shouldRequeue is true, isInversion must be true.

nit: I think it is easier for the reader to say something less verbose like
// INVARIANT: shouldRequeue => isInversion

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some comments. Generally looks good! The priority rounding is a bit of a rangle, but I don't see how to easily avoid that (except if we allow more inversions).

// WithinPriorityRange checks if a priority is within the range of possible
// priorities for the allocator actions.
func withinPriorityRange(priority float64) bool {
return AllocatorNoop.Priority() <= priority && priority <= AllocatorFinalizeAtomicReplicationChange.Priority()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be cleaner to add a sentinel value to the enum and to check on it as the upper bound here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, consider inlining this function. It doesn't make much sense when looked at without context, and it's only used in one place. It also isn't being tested directly, nor would that make enough sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added AllocatorMaxPriority for the upper bound and added a NB on AllocatorNoop should have the lowest priority. Also put this function under CheckPriorityInversion.

priorityAtEnqueue float64, actionAtProcessing AllocatorAction,
) (isInversion bool, shouldRequeue bool) {
// priorityAtEnqueue of -1 is a special case reserved for processing logic to
// run even if there’s a priority inversion. If the priority is not -1, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This middle sentence ("If the priority is not -1, ...") is a bit of a non sequitur. The surrounding paragraphs explain the special case of -1. I'd just say something along the lines of

// NB: priorityAtEnqueue is -1 for callers such as scatter, dry runs, and manual queue runs.
// Priority inversion does not apply to these calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// CheckPriorityInversion checks if the priority at enqueue time is higher than
// the priority corresponding to the action computed at processing time. It
// returns whether there was a priority inversion and whether the caller should
// skip the processing of the range since the inversion is considered unfair.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it "It returns whether there was a priority inversion (and the range should not be processed at this time, since doing so could starve higher-priority items), and whether the caller should re-add the range to the queue (presumably under its new priority)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the comment.

// AllocatorRemoveLearner, and AllocatorReplaceDeadVoter will be rounded to
// the same priority. They are so close to each other, so we don't really
// count it as an inversion among them.
normPriorityAtEnqueue := roundToNearestPriorityCategory(priorityAtEnqueue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the priority enum with a hard-to-miss message that priorities should be "separated" by multiples of 100 (explaining why) and also update the places where we adjust the priority to explain why each adjustment is ok?

Copy link
Contributor Author

@wenyihu6 wenyihu6 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a NB comment and a regression unit test. I noticed that we are making some assumptions with this statement, but I think the assumptions should be true. (Update: After some thinking, I don't know if I can say the second assumption should be true. We can revisit them.)

Copy link
Contributor Author

@wenyihu6 wenyihu6 Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: After some thinking, I don't know if I can say the assumptions (especially the second one) should be true. I should revisit them if strong guarantees are required. It should be fine for now since priority inversion re-queuing only care about repair action v.s. consider rebalance. I added a comment to warn new uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying that with sufficiently high replication factors, some of the priorities may be adjusted by more than 50, right? I can live with this, add a comment about it and let's move on.

"kv.priority_inversion_requeue_replicate_queue.enabled",
"whether the requeue replicas should requeue when enqueued for "+
"repair action but ended up consider rebalancing during processing",
false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we land a separate PR on master that flips this to true (so that we get some exposure), and backport only this PR (keep default value false) if we backport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, added a TODO to remind myself after this PR lands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put up #152749 which stacks on top of this PR.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg dismissed @sumeerbhola from 3 discussions.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @sumeerbhola, and @wenyihu6)


pkg/kv/kvserver/replicate_queue.go line 910 at r6 (raw file):

		}

		if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {

It would be good to backport something that is a no-op unless opted into. But this code is not a no-op, the user can only control whether to requeue or skip. Should we make the cluster setting about enabling the inversion detection mechanism as a whole, and not have a way to toggle just the requeue? That way, we can backport the thing as off, our customer can turn it on, and we also turn it on on master so that it will ship with the next release for everyone else. If the requeuing becomes a problem, we'll find out and users can toggle off the entire inversion mechanism as a temporary work-around.


pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

		if shouldRequeue && PriorityInversionRequeue.Get(&rq.store.cfg.Settings.SV) {
			// Return true here to requeue the range. We can't return an error here

We can't return an error but we return an error? Something is off here. I agree (reading the code) that requeuing has no effect if an error is returned.

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replicate_queue.go line 910 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It would be good to backport something that is a no-op unless opted into. But this code is not a no-op, the user can only control whether to requeue or skip. Should we make the cluster setting about enabling the inversion detection mechanism as a whole, and not have a way to toggle just the requeue? That way, we can backport the thing as off, our customer can turn it on, and we also turn it on on master so that it will ship with the next release for everyone else. If the requeuing becomes a problem, we'll find out and users can toggle off the entire inversion mechanism as a temporary work-around.

Added a commit to make PriorityInversionRequeue guard both inversion check and requeuing.


pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We can't return an error but we return an error? Something is off here. I agree (reading the code) that requeuing has no effect if an error is returned.

Stale comment, updated for clarity. Originally, I considered returning requeue = true, err = nil to allow requeuing on error, but instead chose this approach to ensure correct logging and error metric tracking in processReplica and finishProcessingReplica.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 3266 at r6 (raw file):

Previously, sumeerbhola wrote…

nit: I think it is easier for the reader to say something less verbose like
// INVARIANT: shouldRequeue => isInversion

Done.

"kv.priority_inversion_requeue_replicate_queue.enabled",
"whether the requeue replicas should requeue when enqueued for "+
"repair action but ended up consider rebalancing during processing",
false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, added a TODO to remind myself after this PR lands.

// WithinPriorityRange checks if a priority is within the range of possible
// priorities for the allocator actions.
func withinPriorityRange(priority float64) bool {
return AllocatorNoop.Priority() <= priority && priority <= AllocatorFinalizeAtomicReplicationChange.Priority()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added AllocatorMaxPriority for the upper bound and added a NB on AllocatorNoop should have the lowest priority. Also put this function under CheckPriorityInversion.

// CheckPriorityInversion checks if the priority at enqueue time is higher than
// the priority corresponding to the action computed at processing time. It
// returns whether there was a priority inversion and whether the caller should
// skip the processing of the range since the inversion is considered unfair.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the comment.

priorityAtEnqueue float64, actionAtProcessing AllocatorAction,
) (isInversion bool, shouldRequeue bool) {
// priorityAtEnqueue of -1 is a special case reserved for processing logic to
// run even if there’s a priority inversion. If the priority is not -1, the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// AllocatorRemoveLearner, and AllocatorReplaceDeadVoter will be rounded to
// the same priority. They are so close to each other, so we don't really
// count it as an inversion among them.
normPriorityAtEnqueue := roundToNearestPriorityCategory(priorityAtEnqueue)
Copy link
Contributor Author

@wenyihu6 wenyihu6 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a NB comment and a regression unit test. I noticed that we are making some assumptions with this statement, but I think the assumptions should be true. (Update: After some thinking, I don't know if I can say the second assumption should be true. We can revisit them.)

"kv.priority_inversion_requeue_replicate_queue.enabled",
"whether the requeue replicas should requeue when enqueued for "+
"repair action but ended up consider rebalancing during processing",
false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put up #152749 which stacks on top of this PR.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

@tbg reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 1 of 1 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, all commit messages.
@tbg dismissed @sumeerbhola from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @wenyihu6)


pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Stale comment, updated for clarity. Originally, I considered returning requeue = true, err = nil to allow requeuing on error, but instead chose this approach to ensure correct logging and error metric tracking in processReplica and finishProcessingReplica.

But won't requeue=true be ignored if there is an error? This gets returned up here

if err != nil {
return false, err
}

and requeue is never acted upon. So what's the point of giving the impression that requeuing will happen when it isn't actually happening? IMO we shouldn't enter this branch on error at all, an error when computing a plan is not an inversion.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 263 at r7 (raw file):

// AllocatorRemoveDeadVoter, and AllocatorRemoveVoter, the priority may be
// adjusted (see ComputeAction for details), but the adjustment is expected to
// be small (<100).

<49, right?


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

// NB: The returned priority may include a small adjustment and therefore might
// not exactly match action.Priority(). See AllocatorAddVoter,
// AllocatorRemoveDeadVoter, AllocatorRemoveVoter below. The adjustment should be <100.

<49


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

// NB: The returned priority may include a small adjustment and therefore might
// not exactly match action.Priority(). See AllocatorAddVoter,
// AllocatorRemoveDeadVoter, AllocatorRemoveVoter below. The adjustment should be <100.

// The adjustment should be <49 (see AllocatorAction.Priority()).


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 997 at r7 (raw file):

// AllocatorRemoveDeadVoter, AllocatorRemoveVoter below. The adjustment should be <100.
//
// The claim that the adjustment is < 100 has two assumptions:

49 and the other numbers below also need touching up.

// AllocatorRemoveLearner, and AllocatorReplaceDeadVoter will be rounded to
// the same priority. They are so close to each other, so we don't really
// count it as an inversion among them.
normPriorityAtEnqueue := roundToNearestPriorityCategory(priorityAtEnqueue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying that with sufficiently high replication factors, some of the priorities may be adjusted by more than 50, right? I can live with this, add a comment about it and let's move on.

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @tbg)


pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

won't requeue=true be ignored if there is an error?

I changed it so that we may requeue on errors as well in this commit:

if err != nil {
if requeue {
log.KvDistribution.VEventf(ctx, 1, "re-queuing on errors: %v", err)
rq.maybeAdd(ctx, repl, rq.store.Clock().NowAsClockTimestamp())
}
return false, err
}
. Without this change, requeuing wouldn’t occur unless we return requeue = true, err = nil instead. My intention was to treat priority inversion as an error (for logging and metrics in finishProcessingReplica) while still making it requeueable. Does this seem like a right decision to you?

Good call on we shouldn't enter this branch on error at all. I moved the priority inversion check down below so that we return early if there is a planning error to begin with.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 263 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

<49, right?

Good catch, done.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

<49

Fixed.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 995 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// The adjustment should be <49 (see AllocatorAction.Priority()).

Fixed.


pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 997 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

49 and the other numbers below also need touching up.

Fixed.

@wenyihu6 wenyihu6 requested a review from tbg September 2, 2025 01:36
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @wenyihu6)


pkg/kv/kvserver/replicate_queue.go line 911 at r6 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

won't requeue=true be ignored if there is an error?

I changed it so that we may requeue on errors as well in this commit:

if err != nil {
if requeue {
log.KvDistribution.VEventf(ctx, 1, "re-queuing on errors: %v", err)
rq.maybeAdd(ctx, repl, rq.store.Clock().NowAsClockTimestamp())
}
return false, err
}
. Without this change, requeuing wouldn’t occur unless we return requeue = true, err = nil instead. My intention was to treat priority inversion as an error (for logging and metrics in finishProcessingReplica) while still making it requeueable. Does this seem like a right decision to you?

Good call on we shouldn't enter this branch on error at all. I moved the priority inversion check down below so that we return early if there is a planning error to begin with.

Oh, I missed that - thank you. The code change looks innocuous enough, so I'm okay with this semantic change as long as we are sure it's only getting hit for this new error return path. I'll resolve this but left a comment on the requeue logic.


pkg/kv/kvserver/replicate_queue.go line 700 at r15 (raw file):

		if err != nil {
			if requeue {

Could this code be restructured as follows:

if requeue { ... }
if err != nil { return false, err }
if testingAggressive.... {}

// deleted the requeue branch that was here
return true, nil

You could also add a comment on why we're requeuing in both cases.

I assume you are confident we weren't requeuing on any other errors before, right? With an eye towards backporting that would be good to know for sure.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 1 of 1 files at r14, 1 of 1 files at r15.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @wenyihu6)

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @tbg)


pkg/kv/kvserver/replicate_queue.go line 700 at r15 (raw file):
Would you be okay if I addressed this in a follow-up PR? This one is getting long, and I assume we’d prefer not to backport the refactoring. I can add the comments now if you prefer.

I assume you are confident we weren't requeuing on any other errors before, right?
Yes, this is the only place before where we return requeue = true

// Otherwise, requeue to see if there is more work to do. As the
// operation succeeded and was planned for a repair action i.e. not
// rebalancing.
requeue = true
and it's with a nil error
return ShouldRequeue(ctx, change, conf), nil
.

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @tbg)


pkg/kv/kvserver/replicate_queue.go line 700 at r15 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Would you be okay if I addressed this in a follow-up PR? This one is getting long, and I assume we’d prefer not to backport the refactoring. I can add the comments now if you prefer.

I assume you are confident we weren't requeuing on any other errors before, right?
Yes, this is the only place before where we return requeue = true

// Otherwise, requeue to see if there is more work to do. As the
// operation succeeded and was planned for a repair action i.e. not
// rebalancing.
requeue = true
and it's with a nil error
return ShouldRequeue(ctx, change, conf), nil
.

I ended up just doing the refactoring in this PR. Doesn't look too complicated to backport.

@wenyihu6 wenyihu6 requested a review from tbg September 3, 2025 00:42
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :lgtm:

@tbg reviewed 1 of 1 files at r16, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/kv/kvserver/replicate_queue.go line 698 at r16 (raw file):
Just a style nit: rather than "claiming" that only a particular case hits this (which may just not be true forever and which might confuse the reader who may have no context on where or why that happens), I'd say something like

// Most requeuing operations return a nil error, but we intentionally also requeue on error, if this is requested. At the time of writing, priority inversions requeue while also returning an error.

This commit adds a new cluster setting PriorityInversionRequeue that controls
whether the replicate queue should requeue replicas when their priority at
enqueue time differs significantly from their priority at processing time
(e.g. dropping from top 3 to the lowest priority).
Previously, a replica could enter the queue with high priority but, by the time
it was processed, the action planned for this replica may have a low priority,
causing us to perform low priority work. Specifically, we are mostly worried
about cases where the priority changes from any of the repair actions to
consider rebalance. Rebalancing could take a long time and block other ranges
enqueued with actual repair action needed. This commit ensures that such
replicas are requeued instead, avoiding priority inversions.
Previously, replicateQueue used V(2) to log info on priority inverted replicas
because I wanted visibility into every case without missing any replicas. On
reflection, the individual cases aren’t that interesting - it’s the overall
volume that matters, which we can track through metrics. This commit changes it
so that we just rate limit priority inversions every 3 seconds.
This commit improves the comments for PriorityInversionRequeue and clarifies the
contracts around action.Priority().
This commit refactors CheckPriorityInversion.
This commit adds the TestAllocatorPriorityInvariance test, which acts as a
regression safeguard when new actions are added to AllocatorAction, ensuring the
contract is upheld. See action.Priority() and ComputeAction() for more details
on the contract.
…equeue

Previously, we introduced the PriorityInversionRequeue cluster setting, intended
for backport, to handle cases where a range was enqueued with a high-priority
repair action but, at processing time, a low-priority rebalance action was
computed. In such cases, the caller re-adds the range to the queue under its
updated priority. Although the cluster setting guards this requeue behavior, the
inversion check always ran unconditionally, reducing backport safety. This
commit updates the logic so that the cluster setting guard both the inversion
check and the requeue behavior.
Previously, we checked for priority inversion before planning errors, which
meant we could return requeue = true even when a planning error occurred. This
commit changes it so that planning errors should take higher precedence over a
priority inversion error. rq.processOneChange now returns early if there is a
planning error and only check for priority inversion right before applying a
change.
Previously, we checked for requeue right before returning for both nil and
non-nil errors, making the code harder to follow. This commit refactors
replicateQueue.process to requeue replicas before checking for errors.
@wenyihu6 wenyihu6 force-pushed the newpriorityinversion branch from 1fcb9b8 to b632b95 Compare September 3, 2025 12:59
Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @tbg)


pkg/kv/kvserver/replicate_queue.go line 698 at r16 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just a style nit: rather than "claiming" that only a particular case hits this (which may just not be true forever and which might confuse the reader who may have no context on where or why that happens), I'd say something like

// Most requeuing operations return a nil error, but we intentionally also requeue on error, if this is requested. At the time of writing, priority inversions requeue while also returning an error.

Good point, done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 3, 2025

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Sep 3, 2025

👎 Rejected by code reviews

@wenyihu6 wenyihu6 dismissed sumeerbhola’s stale review September 3, 2025 14:40

Got Sumeer's approval to do so since Sumeer already gave his stamp but left the final approval to others.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 3, 2025

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Sep 3, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants