Skip to content

Conversation

@mbissa
Copy link
Contributor

@mbissa mbissa commented Dec 2, 2025

Addresses : https://github.com/grpc/proposal/blob/master/A94-subchannel-otel-metrics.md

this PR adds sub channel metrics with applicable labels as per the RFC proposal.

RELEASE NOTES:

  • stats/otel: add subchannel metrics to eventually replace the pickfirst metrics

@mbissa mbissa added this to the 1.78 Release milestone Dec 2, 2025
@mbissa mbissa requested a review from easwars December 2, 2025 19:58
@mbissa mbissa added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Dec 2, 2025
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.31%. Comparing base (432bda3) to head (6441dc1).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8738      +/-   ##
==========================================
+ Coverage   83.22%   83.31%   +0.08%     
==========================================
  Files         419      418       -1     
  Lines       32454    32418      -36     
==========================================
- Hits        27009    27008       -1     
+ Misses       4057     4024      -33     
+ Partials     1388     1386       -2     
Files with missing lines Coverage Δ
clientconn.go 90.51% <100.00%> (+<0.01%) ⬆️
internal/internal.go 100.00% <ø> (ø)
internal/transport/http2_client.go 92.23% <100.00%> (-0.39%) ⬇️
internal/transport/transport.go 88.70% <ø> (ø)
internal/xds/xds.go 80.55% <100.00%> (+5.55%) ⬆️

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@mbissa

A94 states the following:

Implementations that have already implemented the pick-first metrics should give 
enough time for users to transition to the new metrics. For example, implementations 
should report both the old pick-first metrics and the new subchannel metrics for
2 releases, and then remove the old pick-first metrics.

Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed.

@arjan-bal
Copy link
Contributor

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

Here is how pickfirst handles this:

  1. When a subchannel connects, all remaining subchannels are removed from pickfirst's subchannel map.
    if newState.ConnectivityState == connectivity.Ready {
    connectionAttemptsSucceededMetric.Record(b.metricsRecorder, 1, b.target)
    b.shutdownRemainingLocked(sd)
  2. The updateSubConnState method ignores any updates from subchannels that are not in the subchannel map.
    // Previously relevant SubConns can still callback with state updates.
    // To prevent pickers from returning these obsolete SubConns, this logic
    // is included to check if the current list of active SubConns includes this
    // SubConn.
    if !b.isActiveSCData(sd) {
    return
    }

Comment on lines 55 to 58
labels := make(map[string]string)
labels["grpc.lb.locality"] = locality
labels["grpc.lb.backend_service"] = cluster
return labels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be shorted using the literal initialization syntax in Go for maps:

	return map[string]string {
	    "grpc.lb.locality": locality,
	    "grpc.lb.backend_service": cluster,
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

clientconn.go Outdated
Comment on lines 1375 to 1382
var locality, backendService string
labelsFromAddress, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string)
if len(ac.addrs) > 0 && internal.AddressToTelemetryLabels != nil && ok {
labels := labelsFromAddress(ac.addrs[0])
locality = labels["grpc.lb.locality"]
backendService = labels["grpc.lb.backend_service"]
}
return locality, backendService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be shorted as follows:

Suggested change
var locality, backendService string
labelsFromAddress, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string)
if len(ac.addrs) > 0 && internal.AddressToTelemetryLabels != nil && ok {
labels := labelsFromAddress(ac.addrs[0])
locality = labels["grpc.lb.locality"]
backendService = labels["grpc.lb.backend_service"]
}
return locality, backendService
labelsFunc, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string)
if !ok || len(ac.addrs) == 0 {
return "", ""
}
labels := labelsFunc(ac.addrs[0])
return labels["grpc.lb.locality"], labels["grpc.lb.backend_service"]

Follows from the principle of handling errors before proceeding to the rest of the code: See: go/go-style/decisions#indent-error-flow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

clientconn.go Outdated
}
locality, backendService := fetchLabels(ac)
if ac.state == connectivity.Ready || (ac.state == connectivity.Connecting && s == connectivity.Idle) {
disconnectionsMetric.Record(ac.cc.metricsRecorderList, 1, ac.cc.target, backendService, locality, "unknown")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the disconnect_error label set to unknown here? The gRFC talks about a set of possible values for this label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires a lot of plumbing. I will be doing a follow up PR for that.

clientconn.go Outdated
if ac.state == s {
return
}
locality, backendService := fetchLabels(ac)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of fetching the labels everytime the connectivity state changes, would it make sense to fetch them when the addrConn is created as part of handling NewSubConn and updating the labels when UpdateAddresses is invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@easwars easwars removed their assignment Dec 5, 2025
@mbissa
Copy link
Contributor Author

mbissa commented Dec 8, 2025

@mbissa

A94 states the following:

Implementations that have already implemented the pick-first metrics should give 
enough time for users to transition to the new metrics. For example, implementations 
should report both the old pick-first metrics and the new subchannel metrics for
2 releases, and then remove the old pick-first metrics.

Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed.

Issue filed - #8752

@mbissa
Copy link
Contributor Author

mbissa commented Dec 8, 2025

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

Handled as per PR

t.Errorf("Unexpected data for metric %v, got: %v, want: %v", "grpc.lb.pick_first.disconnections", got, 0)
}

//Checking for subchannel metrics as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Space between the // and the start of the comment, and please terminate comment sentences with a period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// subchannel won) while its dial is still in-flight, it records exactly one
// successful attempt.
func (s) TestPickFirstLeaf_HappyEyeballs_Ignore_Inflight_Cancellations(t *testing.T) {
t.Log("starting test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

rb.UpdateState(resolver.State{Addresses: addrs})
cc.Connect()

// Make sure we conncet to second subconn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/conncet/connect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines +2172 to +2178
// Wait for the SUCCESS metric to ensure recording logic has processed.
waitForMetric(ctx, t, tmr, "grpc.subchannel.connection_attempts_succeeded")

// Verify Success: Exactly 1 (The Winner).
if got, _ := tmr.Metric("grpc.subchannel.connection_attempts_succeeded"); got != 1 {
t.Errorf("Unexpected data for metric %v, got: %v, want: 1", "grpc.subchannel.connection_attempts_succeeded", got)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this actually ensure that we check the value of the metric after the first connection attempt is completely processed? We do call holds[0].Resume(), but does that guarantee that the subchannel code sees the connection being successful, but drops it since the subchannel has been deleted by the LB policy.

Copy link
Contributor Author

@mbissa mbissa Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are waiting for the metric to be emitted. Connection attempt success will only be emitted if there is a successful connection. In case of cancellation of attempt - it will not be successful and in case of disconnection after establishing connection, it will still be recorded as a disconnection. In both scenarios, the attempts succeeded will always be 1.


// Verify Failure: Exactly 0 (The Loser was ignored).
// We poll briefly to ensure no delayed failure metric appears.
shortCtx, shortCancel := context.WithTimeout(ctx, 50*time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The commonly used variable names for this are sCtx and sCancel. Go generally prefers shorter variable names where appropriate. See: go/go-style/decisions#variable-names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

ac.localityLabel = ""
ac.backendServiceLabel = ""
labelsFunc, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string)
if !ok || len(ac.addrs) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the len(ac.addrs) == 0 check is required. I see that updateTelemetryLabelsLocked is called from two places:

  • NewSubConn
  • UpdateAddresses

NewSubConn actually verifies that we cannot have an empty address list. There is no such verification on the UpdateAddresses code path though. Let's start an internal discussion on this to see if we should actually allow the latter to set the addresses to an empty list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO. Some code change will happen when the API changes to allow one address exactly.

}

// updateTelemetryLabelsLocked calculates and caches the telemetry labels based on the
// current addresses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the first address is used. Maybe we can clarify that in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

return
}

if ac.state == connectivity.Ready || (ac.state == connectivity.Connecting && s == connectivity.Idle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here could be useful to clarify to the reader that any transition out of Ready means that we've seen a disconnection, and for the other special case of handling the missing Ready, we can probably include a link to the tracking issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


type securityLevelKey struct{}

func (ac *addrConn) securityLevel() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called securityLevelLocked() instead since this assumes that the lock is being held?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1332 to +1333
// This records cancelled connection attempts which can be later replaced by a metric.
channelz.Infof(logger, ac.channelz, "Received context cancelled on subconn, not recording this as a failed connection attempt.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to add a trace event for this as well? channelz.Infof does that. If we just want to log this, then logger.Infof should be good enough and we should guard this with V(2) check.

Also, the log message could be improved. How about something like "Context cancellation detected; not recording this as a failed connection attempt."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@easwars easwars removed their assignment Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants