Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error counter for k8s client #11774

Merged
merged 2 commits into from
Dec 20, 2023
Merged

Add error counter for k8s client #11774

merged 2 commits into from
Dec 20, 2023

Conversation

adleong
Copy link
Member

@adleong adleong commented Dec 15, 2023

We add a http_client_errors_total counter metric to control plane components to measure when requests to the k8s API fail without a response.

@adleong adleong requested a review from a team as a code owner December 15, 2023 22:30
@@ -62,6 +62,14 @@ var (
[]string{"client", "code", "method"},
)

clientErrorCounter = prometheus.NewCounterVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clientErrorCounter needs to be registered in the init() function in order to be accounted for.

Signed-off-by: Alex Leong <[email protected]>
@adleong
Copy link
Member Author

adleong commented Dec 19, 2023

With some manual testing I can see this counter increase when connection to the k8s api is interrupted.

Surprisingly, the logs are somewhat light in terms of information when this happens. At debug level we see messages like:

INFO[2023-12-18T16:33:33-08:00] GET https://0.0.0.0:32929/apis/discovery.k8s.io/v1/endpointslices?allowWatchBookmarks=true&resourceVersion=753732&timeout=8m14s&timeoutSeconds=494&watch=true  in 0 milliseconds

showing that API responses are returning in 0 milliseconds, but not indicating that this is because of a failed connection.

We have to increase the log level to trace before we see:

INFO[2023-12-18T16:34:54-08:00] curl -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: main/v0.0.0 (darwin/amd64) kubernetes/$Format" 'https://0.0.0.0:32929/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=753707&timeout=7m5s&timeoutSeconds=425&watch=true'
INFO[2023-12-18T16:34:54-08:00] HTTP Trace: Dial to tcp:0.0.0.0:32929 failed: dial tcp 0.0.0.0:32929: connect: connection refused
INFO[2023-12-18T16:34:54-08:00] GET https://0.0.0.0:32929/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=753707&timeout=7m5s&timeoutSeconds=425&watch=true  in 0 milliseconds

showing that there was a connection refused.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@adleong adleong merged commit e59ae0f into main Dec 20, 2023
33 checks passed
@adleong adleong deleted the alex/k8s-client-errors branch December 20, 2023 17:34
@adleong adleong mentioned this pull request Dec 20, 2023
adleong added a commit that referenced this pull request Dec 20, 2023
This edge release contains improvements to the logging and diagnostics of the
destination controller.

* Added a control plane metric to count errors talking to the Kubernetes API
  ([#11774])
* Fixed an issue causing spurious destination controller error messages for
  profile lookups on unmeshed pods with port in default opaque list ([#11550])

[#11774]: #11774
[#11550]: #11550

Signed-off-by: Alex Leong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants