-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promhttp: count hard request failures in roundtripper metrics #943
promhttp: count hard request failures in roundtripper metrics #943
Conversation
Right now InstrumentRoundTripper{Counter,Duration} don't increase any metrics in case of hard request failures (e.g., TCP, TLS failures). This makes these functions unattractive to use, as those failures are typically among the most interesting ones to alert on. This change extends these functions to count such requests properly. To distinguish them from requests that did yield an actual response, we leave the "code" label empty. This seems the correct thing to do, because hard request failures don't yield a code from the remote side. Signed-off-by: Ed Schouten <[email protected]>
5ba3bfd
to
ca86423
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution!
Hm, we somehow never needed that as such error would usually surface in other means.
Think about the abstraction of round tripper. We have res, err := r.RoundTrip(...)
. If we have no error we have kind of successful roundtrip from point of view of this method, thus we increment that request was made with such code.
If error was returned, technically no request was made. So if our function observes results, there was no result too:
InstrumentRoundTripperCounter is a middleware that wraps the provided
// http.RoundTripper to observe the request result with the provided CounterVec.
If no result was returned error
will be returned. Then users handle errors, usually by incrementing some metric due. If we increment our metric here we risk the anti-pattern of handling error multiple times, especially for existing users of our library.
It really depends on the semantics the user have on this. I am afraid this change might surprise many. Is there away to provide optionality for this feature? 🤔
@@ -64,6 +84,8 @@ func InstrumentRoundTripperCounter(counter *prometheus.CounterVec, next http.Rou | |||
resp, err := next.RoundTrip(r) | |||
if err == nil { | |||
counter.With(labels(code, method, r.Method, resp.StatusCode)).Inc() | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just return resp, err
here and avoid using else (opinionated nit).
@@ -93,6 +115,8 @@ func InstrumentRoundTripperDuration(obs prometheus.ObserverVec, next http.RoundT | |||
resp, err := next.RoundTrip(r) | |||
if err == nil { | |||
obs.With(labels(code, method, r.Method, resp.StatusCode)).Observe(time.Since(start).Seconds()) | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just return resp, err
here and avoid using else (opinionated nit).
// for the case where a HTTP request failed without getting a server | ||
// response. The "code" label is left empty to distinguish from cases | ||
// where a server response was obtained. | ||
func labelsForError(code, method bool, reqMethod string) prometheus.Labels { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reusing label
and injecting custom code (e.g -1) that would be a sentinel for us to put empty label ""? Less deduplicate functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"code" suppose to be a valid HTTP code after the latest #962 change. I totally agree that we need to make it more sentinel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need behind this. And I think it would be nice to have a facility to track underlying transport layer errors. However, there are several things we need to account here:
- First of all, as of know
RoundTripper
description and API suggests it only meant to used for HTTP, so having errors from underlying layers could be confusing. As we already discussed we assume metrics would have HTTP status code and method labels, not having them or overloading their usage would be confusing. - As Bartek suggested this is changing existing behaviour and this could confuse people. Even changing and implicit behaviour could be problematic for the dependants of the library.
Those being said, I think if we steer the direction of the design we can achieve the same goal.
My suggestion would be to provide wrapper instrumented transport layers in client_golang. Such as instrumentedTCPTransport
. So that we can have transport specific metrics and more over users can opt-in to usages of those.
What do you think?
// for the case where a HTTP request failed without getting a server | ||
// response. The "code" label is left empty to distinguish from cases | ||
// where a server response was obtained. | ||
func labelsForError(code, method bool, reqMethod string) prometheus.Labels { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"code" suppose to be a valid HTTP code after the latest #962 change. I totally agree that we need to make it more sentinel.
@EdSchouten Do you want to continue working on this? |
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Right now InstrumentRoundTripper{Counter,Duration} don't increase any
metrics in case of hard request failures (e.g., TCP, TLS failures). This
makes these functions unattractive to use, as those failures are
typically among the most interesting ones to alert on.
This change extends these functions to count such requests properly. To
distinguish them from requests that did yield an actual response, we
leave the "code" label empty. This seems the correct thing to do,
because hard request failures don't yield a code from the remote side.