-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: improve observability with decommission nudger #152787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
e4392c3 to
1ef144c
Compare
1ef144c to
344ecc2
Compare
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to clarify the metric semantics better, but directionally 👍
@tbg reviewed 1 of 3 files at r2, 4 of 5 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/metrics.go line 2181 at r6 (raw file):
Unit: metric.Unit_COUNT, } metaReplicateQueueEnqueueFailures = metric.Metadata{
what does it mean to fail to be enqueued? In particular, where does the case where shouldQueue returns false land? That shouldn't really be called a "failure"
This avoids the distinction between failed and skipped which I haven't been able to understand even after reviewing the code.
In my understanding, we are interested in distinguishing three cases:
- the replica was added to the queue.
- it was not added to the queue because there was nothing to do for this replica (shouldQueue returned false)
- it was not added to the queue because one of the many preconditions (zone config available, holds lease, etc, did not hold).
So maybe
*.enqueue.accepted*.enqueue.no_action*.enqueue.failed_precondition
I'm not totally in love with these names either, but it might be clearer.
pkg/kv/kvserver/metrics.go line 2190 at r6 (raw file):
metaReplicateQueueEnqueueSkipped = metric.Metadata{ Name: "queue.replicate.enqueue.skipped", Help: "Number of replicas which didn't attempt to be enqueued but returned " +
what does it mean to "not attempt to be enqueued" but then to be skipped?
pkg/kv/kvserver/queue.go line 829 at r6 (raw file):
) (queued bool, err error) { defer func() { bq.updateMetricsOnEnqueueResult(queued)
Could you add a note that queued => err == nil, which seems to be true based on reading the method.
wenyihu6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/metrics.go line 2181 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
what does it mean to fail to be enqueued? In particular, where does the case where
shouldQueuereturns false land? That shouldn't really be called a "failure"
This avoids the distinction betweenfailedandskippedwhich I haven't been able to understand even after reviewing the code.In my understanding, we are interested in distinguishing three cases:
- the replica was added to the queue.
- it was not added to the queue because there was nothing to do for this replica (shouldQueue returned false)
- it was not added to the queue because one of the many preconditions (zone config available, holds lease, etc, did not hold).
So maybe
*.enqueue.accepted*.enqueue.no_action*.enqueue.failed_preconditionI'm not totally in love with these names either, but it might be clearer.
Good points, semantics were unclear. I’ve reverted the commit and added four metrics with definitions to clarify their semantics. Lmk if this aligns with what you had in mind (the only new case here is unexpected error)
- queue.replicate.enqueue.add: counts replicas successfully added to the queue
- queue.replicate.enqueue.failedprecondition: counts replicas that failed the
replicaCanBeProcessed precondition check
- queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue
determined no action was needed
- queue.replicate.enqueue.unexpectederror: counts replicas that were expected to
be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue)
but failed due to unexpected errors
pkg/kv/kvserver/metrics.go line 2190 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
what does it mean to "not attempt to be enqueued" but then to be skipped?
Agree that the wording was confusing. Hope it's clearer now.
pkg/kv/kvserver/queue.go line 829 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you add a note that
queued => err == nil, which seems to be true based on reading the method.
Added.
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg reviewed 4 of 4 files at r7, 3 of 3 files at r8, 4 of 4 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):
// 7. processing: the replica is already being processed and not enqueued again. // 8. full queue: the queue is full and the replica is not enqueued again. func TestBaseQueueCallbackOnEnqueueResult(t *testing.T) {
Can the new metrics be sanity-checked in this test as well?
wenyihu6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/kv/kvserver/queue_test.go line 1301 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can the new metrics be sanity-checked in this test as well?
I found a double counting flaw in my metrics updates while adding the tests - we might return queued = true from addInternal for a processing replica that was marked as requeued. I pushed a fix for it as a follow up if you wanna have a look.
|
Previously, wenyihu6 (Wenyi Hu) wrote…
Perhaps we shouldn't return queued = true in this case. |
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg reviewed 1 of 1 files at r11, 2 of 2 files at r12, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
5efbfb3 to
fd61f9d
Compare
|
Rebased on master. |
c66ef31 to
121195d
Compare
|
Last two pushes fixed a linter failure and removed a commit along with its revert commit to make backports cleaner. |
Previously, we added the decommissioning nudger which nudges the leaseholder
replica of decommissioning ranges to enqueue themselves into the replicate queue
for decommissioning. However, we are still observing extended decommission stall
with the nudger enabled. Observability was limited, and we could not easily tell
whether replicas were successfully enqueued or processed.
This commit improves observability by adding four metrics to track the enqueue
and processing results of the decommissioning nudger:
ranges.decommissioning.nudger.{enqueue,process}.{success,failure}.
Previously, observability into base queue enqueuing was limited to pending queue length and process results. This commit adds enqueue-specific metrics for the replicate queue: - queue.replicate.enqueue.add: counts replicas successfully added to the queue - queue.replicate.enqueue.failedprecondition: counts replicas that failed the replicaCanBeProcessed precondition check - queue.replicate.enqueue.noaction: counts replicas skipped because ShouldQueue determined no action was needed - queue.replicate.enqueue.unexpectederror: counts replicas that were expected to be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue) but failed due to unexpected errors
Previously, we updated bq.enqueueAdd inside the defer statement of addInternal. This was incorrect because we may return queued = true for a replica already processing and was marked for requeue. That replica would later be requeued in finishProcessingReplica, incrementing the metric again, lead to double counting.
…ueDecommissionScannerDisabled his commit extends TestBaseQueueCallback* and TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.
121195d to
28e1dc1
Compare
|
Rebased on master to pick up #152967 which caused flakes on CI. |
|
TFTR! bors r=tbg |
Stacked on top of #152792
Resolves: #151847
Epic: none
kvserver: improve observability with decommission nudger
Previously, we added the decommissioning nudger which nudges the leaseholder
replica of decommissioning ranges to enqueue themselves into the replicate queue
for decommissioning. However, we are still observing extended decommission stall
with the nudger enabled. Observability was limited, and we could not easily tell
whether replicas were successfully enqueued or processed.
This commit improves observability by adding four metrics to track the enqueue
and processing results of the decommissioning nudger:
ranges.decommissioning.nudger.{enqueue,process}.{success,failure}.
kvserver: add enqueue metrics to base queue
Previously, observability into base queue enqueuing was limited to pending queue
length and process results. This commit adds enqueue-specific metrics for the
replicate queue:
replicaCanBeProcessed precondition check
determined no action was needed
be enqueued (ShouldQueue returned true or the caller attempted a direct enqueue)
but failed due to unexpected errors
kvserver: move bq.enqueueAdd update to be outside of defer
Previously, we updated bq.enqueueAdd inside the defer statement of addInternal.
This was incorrect because we may return queued = true for a replica already
processing and was marked for requeue. That replica would later be requeued in
finishProcessingReplica, incrementing the metric again, lead to double counting.
kvserver: test metrics in TestBaseQueueCallback and TestReplicateQueueDecommissionScannerDisabled*
his commit extends TestBaseQueueCallback* and
TestReplicateQueueDecommissionScannerDisabled to also verify metric updates.