kvserver: improve observability with decommission nudger #152787

wenyihu6 · 2025-08-30T17:36:05Z

Stacked on top of #152792
Resolves: #151847
Epic: none

kvserver: improve observability with decommission nudger

Previously, we added the decommissioning nudger which nudges the leaseholder
replica of decommissioning ranges to enqueue themselves into the replicate queue
for decommissioning. However, we are still observing extended decommission stall
with the nudger enabled. Observability was limited, and we could not easily tell
whether replicas were successfully enqueued or processed.

This commit improves observability by adding four metrics to track the enqueue
and processing results of the decommissioning nudger:
ranges.decommissioning.nudger.{enqueue,process}.{success,failure}.

kvserver: add enqueue metrics to base queue

Previously, observability into base queue enqueuing was limited to pending queue
length and process results. This commit adds enqueue-specific metrics for the
replicate queue:

queue.replicate.enqueue.add: counts replicas successfully added to the queue
queue.replicate.enqueue.failedprecondition: counts replicas that failed the
replicaCanBeProcessed precondition check
queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue
determined no action was needed
queue.replicate.enqueue.unexpectederror: counts replicas that were expected to
be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue)
but failed due to unexpected errors

kvserver: move bq.enqueueAdd update to be outside of defer

Previously, we updated bq.enqueueAdd inside the defer statement of addInternal.
This was incorrect because we may return queued = true for a replica already
processing and was marked for requeue. That replica would later be requeued in
finishProcessingReplica, incrementing the metric again, lead to double counting.

kvserver: test metrics in TestBaseQueueCallback and TestReplicateQueueDecommissionScannerDisabled*

his commit extends TestBaseQueueCallback* and
TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.

blathers-crl · 2025-08-30T17:36:10Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-08-30T17:36:25Z

This change is

tbg

Would like to clarify the metric semantics better, but directionally 👍

@tbg reviewed 1 of 3 files at r2, 4 of 5 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/kv/kvserver/metrics.go line 2181 at r6 (raw file):

		Unit:        metric.Unit_COUNT,
	}
	metaReplicateQueueEnqueueFailures = metric.Metadata{

what does it mean to fail to be enqueued? In particular, where does the case where shouldQueue returns false land? That shouldn't really be called a "failure"
This avoids the distinction between failed and skipped which I haven't been able to understand even after reviewing the code.

In my understanding, we are interested in distinguishing three cases:

the replica was added to the queue.
it was not added to the queue because there was nothing to do for this replica (shouldQueue returned false)
it was not added to the queue because one of the many preconditions (zone config available, holds lease, etc, did not hold).

So maybe

*.enqueue.accepted
*.enqueue.no_action
*.enqueue.failed_precondition

I'm not totally in love with these names either, but it might be clearer.

pkg/kv/kvserver/metrics.go line 2190 at r6 (raw file):

	metaReplicateQueueEnqueueSkipped = metric.Metadata{
		Name: "queue.replicate.enqueue.skipped",
		Help: "Number of replicas which didn't attempt to be enqueued but returned " +

what does it mean to "not attempt to be enqueued" but then to be skipped?

pkg/kv/kvserver/queue.go line 829 at r6 (raw file):

) (queued bool, err error) {
	defer func() {
		bq.updateMetricsOnEnqueueResult(queued)

Could you add a note that queued => err == nil, which seems to be true based on reading the method.

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

pkg/kv/kvserver/metrics.go line 2181 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what does it mean to fail to be enqueued? In particular, where does the case where shouldQueue returns false land? That shouldn't really be called a "failure"
This avoids the distinction between failed and skipped which I haven't been able to understand even after reviewing the code.

In my understanding, we are interested in distinguishing three cases:

the replica was added to the queue.

it was not added to the queue because there was nothing to do for this replica (shouldQueue returned false)

it was not added to the queue because one of the many preconditions (zone config available, holds lease, etc, did not hold).

So maybe

*.enqueue.accepted

*.enqueue.no_action

*.enqueue.failed_precondition

I'm not totally in love with these names either, but it might be clearer.

Good points, semantics were unclear. I’ve reverted the commit and added four metrics with definitions to clarify their semantics. Lmk if this aligns with what you had in mind (the only new case here is unexpected error)

- queue.replicate.enqueue.add: counts replicas successfully added to the queue  
- queue.replicate.enqueue.failedprecondition: counts replicas that failed the  
  replicaCanBeProcessed precondition check  
- queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue  
  determined no action was needed  
- queue.replicate.enqueue.unexpectederror: counts replicas that were expected to  
  be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue)  
  but failed due to unexpected errors

pkg/kv/kvserver/metrics.go line 2190 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what does it mean to "not attempt to be enqueued" but then to be skipped?

Agree that the wording was confusing. Hope it's clearer now.

pkg/kv/kvserver/queue.go line 829 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you add a note that queued => err == nil, which seems to be true based on reading the method.

Added.

tbg

@tbg reviewed 4 of 4 files at r7, 3 of 3 files at r8, 4 of 4 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):

// 7. processing: the replica is already being processed and not enqueued again.
// 8. full queue: the queue is full and the replica is not enqueued again.
func TestBaseQueueCallbackOnEnqueueResult(t *testing.T) {

Can the new metrics be sanity-checked in this test as well?

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can the new metrics be sanity-checked in this test as well?

I found a double counting flaw in my metrics updates while adding the tests - we might return queued = true from addInternal for a processing replica that was marked as requeued. I pushed a fix for it as a follow up if you wanna have a look.

wenyihu6 · 2025-09-03T18:21:11Z

pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I found a double counting flaw in my metrics updates while adding the tests - we might return queued = true from addInternal for a processing replica that was marked as requeued. I pushed a fix for it as a follow up if you wanna have a look.

Perhaps we shouldn't return queued = true in this case.

tbg

@tbg reviewed 1 of 1 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

wenyihu6 · 2025-09-04T10:56:59Z

Rebased on master.

wenyihu6 · 2025-09-04T12:19:43Z

Last two pushes fixed a linter failure and removed a commit along with its revert commit to make backports cleaner.

Previously, we added the decommissioning nudger which nudges the leaseholder replica of decommissioning ranges to enqueue themselves into the replicate queue for decommissioning. However, we are still observing extended decommission stall with the nudger enabled. Observability was limited, and we could not easily tell whether replicas were successfully enqueued or processed. This commit improves observability by adding four metrics to track the enqueue and processing results of the decommissioning nudger: ranges.decommissioning.nudger.{enqueue,process}.{success,failure}.

Previously, observability into base queue enqueuing was limited to pending queue length and process results. This commit adds enqueue-specific metrics for the replicate queue: - queue.replicate.enqueue.add: counts replicas successfully added to the queue - queue.replicate.enqueue.failedprecondition: counts replicas that failed the replicaCanBeProcessed precondition check - queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue determined no action was needed - queue.replicate.enqueue.unexpectederror: counts replicas that were expected to be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue) but failed due to unexpected errors

Previously, we updated bq.enqueueAdd inside the defer statement of addInternal. This was incorrect because we may return queued = true for a replica already processing and was marked for requeue. That replica would later be requeued in finishProcessingReplica, incrementing the metric again, lead to double counting.

…ueDecommissionScannerDisabled his commit extends TestBaseQueueCallback* and TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.

wenyihu6 · 2025-09-04T15:59:18Z

Rebased on master to pick up #152967 which caused flakes on CI.

wenyihu6 · 2025-09-04T17:25:18Z

TFTR!

bors r=tbg

craig · 2025-09-04T18:36:52Z

Build succeeded:

wenyihu6 force-pushed the callbackwithmetrics branch from e4392c3 to 1ef144c Compare August 31, 2025 16:14

wenyihu6 changed the title ~~kvserver: add enqueue metrics to base queue~~ kvserver: improve observability with decommission nudger Aug 31, 2025

wenyihu6 force-pushed the callbackwithmetrics branch from 1ef144c to 344ecc2 Compare August 31, 2025 17:08

wenyihu6 marked this pull request as ready for review August 31, 2025 18:53

wenyihu6 requested a review from a team as a code owner August 31, 2025 18:53

wenyihu6 requested review from sumeerbhola and tbg and removed request for a team August 31, 2025 18:53

wenyihu6 mentioned this pull request Aug 31, 2025

kvserver: track priority inversion in replicate queue metrics #152697

Merged

tbg requested changes Sep 1, 2025

View reviewed changes

wenyihu6 commented Sep 3, 2025

View reviewed changes

wenyihu6 requested a review from tbg September 3, 2025 00:32

tbg approved these changes Sep 3, 2025

View reviewed changes

wenyihu6 commented Sep 3, 2025

View reviewed changes

wenyihu6 requested a review from tbg September 3, 2025 18:18

tbg approved these changes Sep 4, 2025

View reviewed changes

wenyihu6 force-pushed the callbackwithmetrics branch 2 times, most recently from 5efbfb3 to fd61f9d Compare September 4, 2025 10:56

wenyihu6 force-pushed the callbackwithmetrics branch 2 times, most recently from c66ef31 to 121195d Compare September 4, 2025 12:18

wenyihu6 added 4 commits September 4, 2025 11:58

kvserver: test metrics in TestBaseQueueCallback* and TestReplicateQue…

28e1dc1

…ueDecommissionScannerDisabled his commit extends TestBaseQueueCallback* and TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.

wenyihu6 force-pushed the callbackwithmetrics branch from 121195d to 28e1dc1 Compare September 4, 2025 15:58

craig bot merged commit f328e00 into cockroachdb:master Sep 4, 2025
32 of 33 checks passed

celeste-cockroachdb bot added the target-release-25.4.0 label Sep 4, 2025

This was referenced Sep 5, 2025

release-25.2.6-rc: kvserver: requeue on priority inversion for replicate queue #153052

Merged

release-24.3.20-rc: kvserver: requeue on priority inversion for replicate queue #153062

Merged

tbg mentioned this pull request Sep 8, 2025

roachtest: allocbench/nodes=7/cpu=8/kv/r=50/ops=skew failed [#152979] #153013

Closed

celeste-cockroachdb bot added v25.4.0-prerelease and removed target-release-25.4.0 labels Sep 22, 2025

This was referenced Dec 17, 2025

release-25.2: kvserver: requeue on priority inversion for replicate #159668

Merged

release-24.3: kvserver: requeue on priority inversion for replicate #159670

Merged

release-25.3: kvserver: requeue on priority inversion for replicate #159695

Merged

kvserver: improve observability with decommission nudger #152787

kvserver: improve observability with decommission nudger #152787

Uh oh!

Conversation

wenyihu6 commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Aug 30, 2025

Uh oh!

cockroach-teamcity commented Aug 30, 2025

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Sep 3, 2025

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Sep 4, 2025

Uh oh!

wenyihu6 commented Sep 4, 2025

Uh oh!

wenyihu6 commented Sep 4, 2025

Uh oh!

wenyihu6 commented Sep 4, 2025

Uh oh!

craig bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenyihu6 commented Aug 30, 2025 •

edited

Loading