Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric timer stops publishing and doesn't recover #148

Open
tom59 opened this issue Sep 13, 2024 · 0 comments
Open

Metric timer stops publishing and doesn't recover #148

tom59 opened this issue Sep 13, 2024 · 0 comments

Comments

@tom59
Copy link

tom59 commented Sep 13, 2024

What happened: When kubelet returns a 401 or a 500, the timer doesn't recover from the exception and stops reporting kubernetes metrics

What you expected to happen: The metrics should still be publishing on schedule when an error occurs

How to reproduce it (as minimally and precisely as possible):

stop kubelet for some some seconds on the kubernetes node and start it again ```systemctl stop kubelet && sleep 30 && systemctl start kubelet``
see errors in the splunk-metrics-splunk-kubernetes-metrics where exception is thrown and timer is detached

2024-09-10 03:09:27 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:cadvisor_metric_scraper error_class=RestClient::Unauthorized error="401 Unauthorized"
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/abstract_response.rb:249:in `exception_with_response'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/abstract_response.rb:129:in `return!'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:836:in `process_result'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:743:in `block in transmit'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/ruby/net/http.rb:966:in `start'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:727:in `transmit'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:163:in `execute'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:63:in `execute'
  2024-09-10 03:09:27 +0000 [error]: #0 /opt/app-root/src/gem/fluent-plugin-kubernetes-metrics-1.2.3/lib/fluent/plugin/in_kubernetes_metrics.rb:728:in `scrape_cadvisor_metrics'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/fluentd-1.15.3/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run_once'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/fluentd-1.15.3/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/fluentd-1.15.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-09-10 03:09:27 +0000 [error]: #0 Timer detached. title=:cadvisor_metric_scraper

Anything else we need to know?:
This issue is similar to splunk/splunk-connect-for-kubernetes#493.
It had been fixed in that ticket by adding an healthcheck on the pod, but the right solution would be for the fluent plugin to recover from that exception in the http client.

Environment:

  • Kubernetes version (use kubectl version): v1.27.13
  • Ruby version (use ruby --version): 2.6.10p210
  • OS (e.g: cat /etc/os-release): RHEL 9.2
  • Splunk version: splunk-connect-for-kubernetes 1.5.4 and fluent-plugin-kubernetes-metrics 1.2.3
  • Others:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant