Skip to content

Conversation

@wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Aug 30, 2025

Stacked on top of #152792
Resolves: #151847
Epic: none


kvserver: improve observability with decommission nudger

Previously, we added the decommissioning nudger which nudges the leaseholder
replica of decommissioning ranges to enqueue themselves into the replicate queue
for decommissioning. However, we are still observing extended decommission stall
with the nudger enabled. Observability was limited, and we could not easily tell
whether replicas were successfully enqueued or processed.

This commit improves observability by adding four metrics to track the enqueue
and processing results of the decommissioning nudger:
ranges.decommissioning.nudger.{enqueue,process}.{success,failure}.


kvserver: add enqueue metrics to base queue

Previously, observability into base queue enqueuing was limited to pending queue
length and process results. This commit adds enqueue-specific metrics for the
replicate queue:

  • queue.replicate.enqueue.add: counts replicas successfully added to the queue
  • queue.replicate.enqueue.failedprecondition: counts replicas that failed the
    replicaCanBeProcessed precondition check
  • queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue
    determined no action was needed
  • queue.replicate.enqueue.unexpectederror: counts replicas that were expected to
    be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue)
    but failed due to unexpected errors

kvserver: move bq.enqueueAdd update to be outside of defer

Previously, we updated bq.enqueueAdd inside the defer statement of addInternal.
This was incorrect because we may return queued = true for a replica already
processing and was marked for requeue. That replica would later be requeued in
finishProcessingReplica, incrementing the metric again, lead to double counting.


kvserver: test metrics in TestBaseQueueCallback and TestReplicateQueueDecommissionScannerDisabled*

his commit extends TestBaseQueueCallback* and
TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.

@blathers-crl
Copy link

blathers-crl bot commented Aug 30, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the callbackwithmetrics branch from e4392c3 to 1ef144c Compare August 31, 2025 16:14
@wenyihu6 wenyihu6 changed the title kvserver: add enqueue metrics to base queue kvserver: improve observability with decommission nudger Aug 31, 2025
@wenyihu6 wenyihu6 force-pushed the callbackwithmetrics branch from 1ef144c to 344ecc2 Compare August 31, 2025 17:08
@wenyihu6 wenyihu6 marked this pull request as ready for review August 31, 2025 18:53
@wenyihu6 wenyihu6 requested a review from a team as a code owner August 31, 2025 18:53
@wenyihu6 wenyihu6 requested review from sumeerbhola and tbg and removed request for a team August 31, 2025 18:53
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to clarify the metric semantics better, but directionally 👍

@tbg reviewed 1 of 3 files at r2, 4 of 5 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/metrics.go line 2181 at r6 (raw file):

		Unit:        metric.Unit_COUNT,
	}
	metaReplicateQueueEnqueueFailures = metric.Metadata{

what does it mean to fail to be enqueued? In particular, where does the case where shouldQueue returns false land? That shouldn't really be called a "failure"
This avoids the distinction between failed and skipped which I haven't been able to understand even after reviewing the code.

In my understanding, we are interested in distinguishing three cases:

  • the replica was added to the queue.
  • it was not added to the queue because there was nothing to do for this replica (shouldQueue returned false)
  • it was not added to the queue because one of the many preconditions (zone config available, holds lease, etc, did not hold).

So maybe

  • *.enqueue.accepted
  • *.enqueue.no_action
  • *.enqueue.failed_precondition

I'm not totally in love with these names either, but it might be clearer.


pkg/kv/kvserver/metrics.go line 2190 at r6 (raw file):

	metaReplicateQueueEnqueueSkipped = metric.Metadata{
		Name: "queue.replicate.enqueue.skipped",
		Help: "Number of replicas which didn't attempt to be enqueued but returned " +

what does it mean to "not attempt to be enqueued" but then to be skipped?


pkg/kv/kvserver/queue.go line 829 at r6 (raw file):

) (queued bool, err error) {
	defer func() {
		bq.updateMetricsOnEnqueueResult(queued)

Could you add a note that queued => err == nil, which seems to be true based on reading the method.

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/metrics.go line 2181 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what does it mean to fail to be enqueued? In particular, where does the case where shouldQueue returns false land? That shouldn't really be called a "failure"
This avoids the distinction between failed and skipped which I haven't been able to understand even after reviewing the code.

In my understanding, we are interested in distinguishing three cases:

  • the replica was added to the queue.
  • it was not added to the queue because there was nothing to do for this replica (shouldQueue returned false)
  • it was not added to the queue because one of the many preconditions (zone config available, holds lease, etc, did not hold).

So maybe

  • *.enqueue.accepted
  • *.enqueue.no_action
  • *.enqueue.failed_precondition

I'm not totally in love with these names either, but it might be clearer.

Good points, semantics were unclear. I’ve reverted the commit and added four metrics with definitions to clarify their semantics. Lmk if this aligns with what you had in mind (the only new case here is unexpected error)

- queue.replicate.enqueue.add: counts replicas successfully added to the queue  
- queue.replicate.enqueue.failedprecondition: counts replicas that failed the  
  replicaCanBeProcessed precondition check  
- queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue  
  determined no action was needed  
- queue.replicate.enqueue.unexpectederror: counts replicas that were expected to  
  be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue)  
  but failed due to unexpected errors

pkg/kv/kvserver/metrics.go line 2190 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what does it mean to "not attempt to be enqueued" but then to be skipped?

Agree that the wording was confusing. Hope it's clearer now.


pkg/kv/kvserver/queue.go line 829 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you add a note that queued => err == nil, which seems to be true based on reading the method.

Added.

@wenyihu6 wenyihu6 requested a review from tbg September 3, 2025 00:32
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 4 of 4 files at r7, 3 of 3 files at r8, 4 of 4 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):

// 7. processing: the replica is already being processed and not enqueued again.
// 8. full queue: the queue is full and the replica is not enqueued again.
func TestBaseQueueCallbackOnEnqueueResult(t *testing.T) {

Can the new metrics be sanity-checked in this test as well?

Copy link
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can the new metrics be sanity-checked in this test as well?

I found a double counting flaw in my metrics updates while adding the tests - we might return queued = true from addInternal for a processing replica that was marked as requeued. I pushed a fix for it as a follow up if you wanna have a look.

@wenyihu6 wenyihu6 requested a review from tbg September 3, 2025 18:18
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 3, 2025

pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I found a double counting flaw in my metrics updates while adding the tests - we might return queued = true from addInternal for a processing replica that was marked as requeued. I pushed a fix for it as a follow up if you wanna have a look.

Perhaps we shouldn't return queued = true in this case.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 1 of 1 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

@wenyihu6 wenyihu6 force-pushed the callbackwithmetrics branch 2 times, most recently from 5efbfb3 to fd61f9d Compare September 4, 2025 10:56
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 4, 2025

Rebased on master.

@wenyihu6 wenyihu6 force-pushed the callbackwithmetrics branch 2 times, most recently from c66ef31 to 121195d Compare September 4, 2025 12:18
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 4, 2025

Last two pushes fixed a linter failure and removed a commit along with its revert commit to make backports cleaner.

Previously, we added the decommissioning nudger which nudges the leaseholder
replica of decommissioning ranges to enqueue themselves into the replicate queue
for decommissioning. However, we are still observing extended decommission stall
with the nudger enabled. Observability was limited, and we could not easily tell
whether replicas were successfully enqueued or processed.

This commit improves observability by adding four metrics to track the enqueue
and processing results of the decommissioning nudger:
ranges.decommissioning.nudger.{enqueue,process}.{success,failure}.
Previously, observability into base queue enqueuing was limited to pending queue
length and process results. This commit adds enqueue-specific metrics for the
replicate queue:

- queue.replicate.enqueue.add: counts replicas successfully added to the queue
- queue.replicate.enqueue.failedprecondition: counts replicas that failed the
  replicaCanBeProcessed precondition check
- queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue
  determined no action was needed
- queue.replicate.enqueue.unexpectederror: counts replicas that were expected to
  be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue)
  but failed due to unexpected errors
Previously, we updated bq.enqueueAdd inside the defer statement of addInternal.
This was incorrect because we may return queued = true for a replica already
processing and was marked for requeue. That replica would later be requeued in
finishProcessingReplica, incrementing the metric again, lead to double counting.
…ueDecommissionScannerDisabled

his commit extends TestBaseQueueCallback* and
TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.
@wenyihu6 wenyihu6 force-pushed the callbackwithmetrics branch from 121195d to 28e1dc1 Compare September 4, 2025 15:58
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 4, 2025

Rebased on master to pick up #152967 which caused flakes on CI.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Sep 4, 2025

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Sep 4, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: improve observability around replicate queue

3 participants