-
Notifications
You must be signed in to change notification settings - Fork 4k
kvserver: track priority inversion in replicate queue metrics #152624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
14f55e5 to
d80f84e
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola reviewed 1 of 2 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @tbg)
pkg/kv/kvserver/replicate_queue.go line 322 at r3 (raw file):
Unit: metric.Unit_COUNT, } metaReplicateQueuePriorityInversionForAddVoterCount = metric.Metadata{
(drive-by comments)
- Do we need all the detailed metrics below. Having too many metrics can be a cost. If needed, we should consider using labeled metrics, so that they are easy to deal with in systems that allow labels https://docs.google.com/document/d/1fn5KX0uPUCw1uW4KkCvmP9-svga0F6XSvhnqdg24MpE/edit?tab=t.0#heading=h.9ndo02nsmu
- Do we already have an enqueue counter metric, that one could compare the rate with of the requeue metric? For instance a requeue rate of 1/s may not matter if the enqueue rate is 100/s.
This commit adds a new cluster setting PriorityInversionRequeue that controls whether the replicate queue should requeue replicas when their priority at enqueue time differs significantly from their priority at processing time (e.g. dropping from top 3 to the lowest priority).
Previously, a replica could enter the queue with high priority but, by the time it was processed, the action planned for this replica may have a low priority, causing us to perform low priority work. Specifically, we are mostly worried about cases where the priority changes from any of the repair actions to consider rebalance. Rebalancing could take a long time and block other ranges enqueued with actual repair action needed. This commit ensures that such replicas are requeued instead, avoiding priority inversions.
Previously, replicas could be enqueued at a high priority but end up processing a lower-priority actions, causing priority inversion and unfairness to other replicas behind them that needs a repair action. This commit adds metrics to track such cases. In addition, this commit also adds metrics to track when replicas are requeued in the replicate queue due to a priority inversion from a repair action to a rebalance action.
|
Closing in favour of a new PR. Will address your comments in the new PR. |
Stacked on top of #152596.
Previously, replicas could be enqueued at a high priority but end up processing
a lower-priority actions, causing priority inversion and unfairness to other
replicas behind them that needs a repair action. This commit adds metrics to
track such cases. In addition, this commit also adds metrics to track when
replicas are requeued in the replicate queue due to a priority inversion from a
repair action to a rebalance action.
Part of: #151847
Release note: none