Skip to content

Conversation

@alyshanjahani-crl
Copy link
Collaborator

@alyshanjahani-crl alyshanjahani-crl commented Nov 18, 2025

This commit creates a new kvpb.AdmittedWorkDoneEvent which the kvadmission controller uses to record total CPU time consumed by KV.

This is done via span.RecordStructured.

The cpu time is added to the execstats.QueryLevelStats object and reported in explain analyze. Note that CPUTime has been renamed to SQLCPUTime to be more accurate and better distinguish the various timings recorded on this object.

Fixes: https://cockroachlabs.atlassian.net/browse/CRDB-55927
Release note (sql change): KV cpu time is now included in traces and EXPLAIN ANALYZE.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@alyshanjahani-crl alyshanjahani-crl marked this pull request as ready for review November 19, 2025 15:23
@alyshanjahani-crl alyshanjahani-crl requested review from a team as code owners November 19, 2025 15:23
@alyshanjahani-crl alyshanjahani-crl requested review from DrewKimball and angles-n-daemons and removed request for a team November 19, 2025 15:23
@alyshanjahani-crl
Copy link
Collaborator Author

The idea for this PR is from this commit sumeerbhola@82402f7

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: for the kv changes

@sumeerbhola reviewed 4 of 13 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @angles-n-daemons and @DrewKimball)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

@yuzefovich reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @angles-n-daemons and @DrewKimball)


pkg/kv/kvserver/kvadmission/kvadmission.go line 316 at r1 (raw file):

			kvflowHandle, found := n.kvflowHandles.LookupReplicationAdmissionHandle(ba.RangeID)
			if !found {
				return Handle{}, nil

nit: should we populate cpuStart here as well, for completeness?


pkg/sql/execstats/traceanalyzer.go line 196 at r1 (raw file):

	s.ContentionEvents = append(s.ContentionEvents, other.ContentionEvents...)
	s.RUEstimate += other.RUEstimate
	s.SQLCPUTime += other.SQLCPUTime

nit: would be good to move this line further up to KVCPUTime.


pkg/sql/opt/exec/explain/output.go line 454 at r1 (raw file):

// independent of platform because the grunning library isn't currently
// supported on all platforms.
func (ob *OutputBuilder) AddCPUTime(cpuTime time.Duration) {

nit: we should rename this method to AddSQLCPUTime.

Copy link
Contributor

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks good on the obs front - just a couple questions about the cpuStart measurement.

cpuStart time.Duration
// Used for measuring post-admission CPU, even for cases that were not
// subject to admission control.
cpuStart time.Duration
Copy link
Contributor

@angles-n-daemons angles-n-daemons Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand better, can you spot check my understanding of this field?

There's some work from: kvBatchServer.Batch -> KV Admitted, which we want to omit from our KV work time reporting, this work is stored as cpuStart on the Handler, the time spent on the cpu by the goroutine up to the point where this field is set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, i'm actually not too sure, but i've asked in KV [1] what the difference is between measuring cpu from the potential call sites (node.Batch , ReplicaSend , and finally this here AdmitKVWork -> AdmittedKVWorkDone)

[1] https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1763667326272419

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, nice thanks for the context

) (_ Handle, retErr error) {
if n.kvAdmissionQ == nil {
return Handle{}, nil
return Handle{cpuStart: grunning.Time()}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set cpuStart on line 316 as well (valid return condition)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now that we're using a defer.

@yuzefovich
Copy link
Member

Oh, how does this change interact with #156895? That change collects KV CPU time for the purposes of top-level query stats reporting (i.e. SQL execution stats) unconditionally whereas here we include things into the trace and include into EXPLAIN ANALYZE output when Structured recording is enabled? Can we not use the collected things from #156895 to do the same though? (BatchResponse fields maybe already show up in the trace (?), and we could use data from topLevelQueryStats to populate EXPLAIN ANALYZE, so this change seems somewhat redundant with that change - am I missing something?)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @DrewKimball, and @yuzefovich)


pkg/kv/kvserver/kvadmission/kvadmission.go line 316 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: should we populate cpuStart here as well, for completeness?

good catch. We could lift the setting of cpuStart to a defer statement.

@alyshanjahani-crl
Copy link
Collaborator Author

alyshanjahani-crl commented Nov 20, 2025

so this change seems somewhat redundant with that change - am I missing something?)

Nope, your understanding is correct!

I'm asking in KV [1] if they'd want this AdmittedWorkDone to contain more info, and if in the case of tracing being enabled, they'd prefer seeing the value from this measurement rather than the stat computed in #156895

Potentially we could close this PR and just put the kvCPUTime from the topLevelQueryStats in EXPLAIN ANALYZE instead as you noted. Let's see what the KV team says.

[1] https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1763667326272419

Copy link
Collaborator Author

@alyshanjahani-crl alyshanjahani-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @angles-n-daemons, @DrewKimball, @sumeerbhola, and @yuzefovich)


pkg/kv/kvserver/kvadmission/kvadmission.go line 316 at r1 (raw file):

Previously, sumeerbhola wrote…

good catch. We could lift the setting of cpuStart to a defer statement.

Ack, now using a defer to set cpuStart when AdmitKVWork is returning successfully (nil error)

) (_ Handle, retErr error) {
if n.kvAdmissionQ == nil {
return Handle{}, nil
return Handle{cpuStart: grunning.Time()}, nil
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now that we're using a defer.

@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Nov 20, 2025
@alyshanjahani-crl alyshanjahani-crl added the O-AI-Review-Real-Issue-Found AI reviewer found real issue label Nov 21, 2025
@alyshanjahani-crl
Copy link
Collaborator Author

alyshanjahani-crl commented Nov 21, 2025

oh dang nice, AI review caught the bug after i modified to use defer, didn't catch the named return value!

From the AI review output:
In pkg/kv/kvserver/kvadmission/kvadmission.go:297-305, the code attempts to set ah.cpuStart in a deferred function:

func (n *controllerImpl) AdmitKVWork(
	ctx context.Context,
	requestTenantID roachpb.TenantID,
	rangeTenantID roachpb.TenantID,
	ba *kvpb.BatchRequest,
) (_ Handle, retErr error) {
	ah := Handle{}
	defer func() {
		if retErr == nil {
			ah.cpuStart = grunning.Time()
		}
	}()
	// ... rest of function
	return ah, nil
}

This doesn't work as intended because:

Handle is returned by value (not as a named return value - note the _ in the signature)
When return ah, nil executes, Go evaluates ah at that moment and copies its value to the return
The defer runs after this evaluation, modifying the local ah variable
The modification to ah.cpuStart never makes it into the returned Handle

This commit creates a new kvpb.AdmittedWorkDoneEvent which the kvadmission
controller uses to record total CPU time consumed by KV.

This is done via span.RecordStructured.

The cpu time is added to the execstats.QueryLevelStats object and reported in
explain analyze. Note that CPUTime has been renamed to SQLCPUTime to be more
accurate and better distinguish the various timings recorded on this object.

Fixes: https://cockroachlabs.atlassian.net/browse/CRDB-55927
Release note (sql change): KV cpu time is now included in traces and
EXPLAIN ANALYZE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. O-AI-Review-Real-Issue-Found AI reviewer found real issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants