-
Notifications
You must be signed in to change notification settings - Fork 4.6k
stats/otel: Add subchannel metrics (A94) #8738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8738 +/- ##
==========================================
+ Coverage 83.22% 83.31% +0.08%
==========================================
Files 419 418 -1
Lines 32454 32418 -36
==========================================
- Hits 27009 27008 -1
+ Misses 4057 4024 -33
+ Partials 1388 1386 -2
🚀 New features to boost your workflow:
|
|
A94 states the following: How are we currently handling this in the pickfirst metrics? |
|
A94 states the following: Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed. |
Here is how pickfirst handles this:
|
internal/xds/xds.go
Outdated
| labels := make(map[string]string) | ||
| labels["grpc.lb.locality"] = locality | ||
| labels["grpc.lb.backend_service"] = cluster | ||
| return labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can be shorted using the literal initialization syntax in Go for maps:
return map[string]string {
"grpc.lb.locality": locality,
"grpc.lb.backend_service": cluster,
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
clientconn.go
Outdated
| var locality, backendService string | ||
| labelsFromAddress, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string) | ||
| if len(ac.addrs) > 0 && internal.AddressToTelemetryLabels != nil && ok { | ||
| labels := labelsFromAddress(ac.addrs[0]) | ||
| locality = labels["grpc.lb.locality"] | ||
| backendService = labels["grpc.lb.backend_service"] | ||
| } | ||
| return locality, backendService |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be shorted as follows:
| var locality, backendService string | |
| labelsFromAddress, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string) | |
| if len(ac.addrs) > 0 && internal.AddressToTelemetryLabels != nil && ok { | |
| labels := labelsFromAddress(ac.addrs[0]) | |
| locality = labels["grpc.lb.locality"] | |
| backendService = labels["grpc.lb.backend_service"] | |
| } | |
| return locality, backendService | |
| labelsFunc, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string) | |
| if !ok || len(ac.addrs) == 0 { | |
| return "", "" | |
| } | |
| labels := labelsFunc(ac.addrs[0]) | |
| return labels["grpc.lb.locality"], labels["grpc.lb.backend_service"] |
Follows from the principle of handling errors before proceeding to the rest of the code: See: go/go-style/decisions#indent-error-flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
clientconn.go
Outdated
| } | ||
| locality, backendService := fetchLabels(ac) | ||
| if ac.state == connectivity.Ready || (ac.state == connectivity.Connecting && s == connectivity.Idle) { | ||
| disconnectionsMetric.Record(ac.cc.metricsRecorderList, 1, ac.cc.target, backendService, locality, "unknown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the disconnect_error label set to unknown here? The gRFC talks about a set of possible values for this label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires a lot of plumbing. I will be doing a follow up PR for that.
clientconn.go
Outdated
| if ac.state == s { | ||
| return | ||
| } | ||
| locality, backendService := fetchLabels(ac) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of fetching the labels everytime the connectivity state changes, would it make sense to fetch them when the addrConn is created as part of handling NewSubConn and updating the labels when UpdateAddresses is invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Issue filed - #8752 |
Handled as per PR |
| t.Errorf("Unexpected data for metric %v, got: %v, want: %v", "grpc.lb.pick_first.disconnections", got, 0) | ||
| } | ||
|
|
||
| //Checking for subchannel metrics as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Space between the // and the start of the comment, and please terminate comment sentences with a period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // subchannel won) while its dial is still in-flight, it records exactly one | ||
| // successful attempt. | ||
| func (s) TestPickFirstLeaf_HappyEyeballs_Ignore_Inflight_Cancellations(t *testing.T) { | ||
| t.Log("starting test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| rb.UpdateState(resolver.State{Addresses: addrs}) | ||
| cc.Connect() | ||
|
|
||
| // Make sure we conncet to second subconn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/conncet/connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| // Wait for the SUCCESS metric to ensure recording logic has processed. | ||
| waitForMetric(ctx, t, tmr, "grpc.subchannel.connection_attempts_succeeded") | ||
|
|
||
| // Verify Success: Exactly 1 (The Winner). | ||
| if got, _ := tmr.Metric("grpc.subchannel.connection_attempts_succeeded"); got != 1 { | ||
| t.Errorf("Unexpected data for metric %v, got: %v, want: 1", "grpc.subchannel.connection_attempts_succeeded", got) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this actually ensure that we check the value of the metric after the first connection attempt is completely processed? We do call holds[0].Resume(), but does that guarantee that the subchannel code sees the connection being successful, but drops it since the subchannel has been deleted by the LB policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are waiting for the metric to be emitted. Connection attempt success will only be emitted if there is a successful connection. In case of cancellation of attempt - it will not be successful and in case of disconnection after establishing connection, it will still be recorded as a disconnection. In both scenarios, the attempts succeeded will always be 1.
|
|
||
| // Verify Failure: Exactly 0 (The Loser was ignored). | ||
| // We poll briefly to ensure no delayed failure metric appears. | ||
| shortCtx, shortCancel := context.WithTimeout(ctx, 50*time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The commonly used variable names for this are sCtx and sCancel. Go generally prefers shorter variable names where appropriate. See: go/go-style/decisions#variable-names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| ac.localityLabel = "" | ||
| ac.backendServiceLabel = "" | ||
| labelsFunc, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string) | ||
| if !ok || len(ac.addrs) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why the len(ac.addrs) == 0 check is required. I see that updateTelemetryLabelsLocked is called from two places:
- NewSubConn
- UpdateAddresses
NewSubConn actually verifies that we cannot have an empty address list. There is no such verification on the UpdateAddresses code path though. Let's start an internal discussion on this to see if we should actually allow the latter to set the addresses to an empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO. Some code change will happen when the API changes to allow one address exactly.
| } | ||
|
|
||
| // updateTelemetryLabelsLocked calculates and caches the telemetry labels based on the | ||
| // current addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the first address is used. Maybe we can clarify that in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| return | ||
| } | ||
|
|
||
| if ac.state == connectivity.Ready || (ac.state == connectivity.Connecting && s == connectivity.Idle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here could be useful to clarify to the reader that any transition out of Ready means that we've seen a disconnection, and for the other special case of handling the missing Ready, we can probably include a link to the tracking issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| type securityLevelKey struct{} | ||
|
|
||
| func (ac *addrConn) securityLevel() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called securityLevelLocked() instead since this assumes that the lock is being held?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // This records cancelled connection attempts which can be later replaced by a metric. | ||
| channelz.Infof(logger, ac.channelz, "Received context cancelled on subconn, not recording this as a failed connection attempt.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to add a trace event for this as well? channelz.Infof does that. If we just want to log this, then logger.Infof should be good enough and we should guard this with V(2) check.
Also, the log message could be improved. How about something like "Context cancellation detected; not recording this as a failed connection attempt."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Addresses : https://github.com/grpc/proposal/blob/master/A94-subchannel-otel-metrics.md
this PR adds sub channel metrics with applicable labels as per the RFC proposal.
RELEASE NOTES: