node-feature-discovery sends excessive LIST requests to the API server #1891

jslouisyou · 2024-09-30T07:44:44Z

What happened: node-feature-discovery of gpu-operator sends excessive LIST requests to the API server

What you expected to happen:
Recently I got several alerts from K8S cluster which describes that API server tooks so long time to serve a LIST request from gpu-operator. Here's the alert and rule that I'm using:

Alert:

Long API server 99%-tile Latency
LIST: 29.90 seconds while nfd.k8s-sigs.io/v1alpha1/nodefeatures request.

Rule: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!~"(log|exec|portforward|proxy)",verb!~"^(?:CONNECT|WATCHLIST|WATCH)$"} [10m])) WITHOUT (instance)) > 10

I also found all gpu-operator-node-feature-discovery-worker pods are tried to send GET verb to API server to query the nodefeatures resource (assumed that this pod needed to get information about node labels). Here's the part of audit log:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"df926f36-8c1f-488e-ac88-11690e24660a","stage":"ResponseComplete","requestURI":"/apis/nfd.k8s-sigs.io/v1alpha1/namespaces/gpu-operator/nodefeatures/sra100-033","verb":"get","user":{"username":"system:serviceaccount:gpu-operator:node-feature-discovery","uid":"da2306ea-536f-455d-bf18-817299dd5489","groups":["system:serviceaccounts","system:serviceaccounts:gpu-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["gpu-operator-node-feature-discovery-worker-49qq6"],"authentication.kubernetes.io/pod-uid":["65dfb997-221e-4a5c-92df-7ff111ea6137"]}},"sourceIPs":["75.17.103.53"],"userAgent":"nfd-worker/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"nodefeatures","namespace":"gpu-operator","name":"sra100-033","apiGroup":"nfd.k8s-sigs.io","apiVersion":"v1alpha1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2024-08-07T01:35:20.355504Z","stageTimestamp":"2024-08-07T01:35:20.676700Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"gpu-operator-node-feature-discovery\" of ClusterRole \"gpu-operator-node-feature-discovery\" to ServiceAccount \"node-feature-discovery/gpu-operator\""}}

I think this is strange that it takes this long to process LIST requests when my k8s cluster only has 300 GPU nodes and why node-feature-discovery-worker pods are sending GET request every minute.

Do you have any information about this problem?
If there are any parameters that can be changed or if you could provide any ideas, I would be very grateful.

Thanks!

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: node-feature-discovery was deployed during installation with gpu-operator from NVIDIA. I used gpu-operator v23.3.2 version.

Environment:

Kubernetes version (use kubectl version): k8s v1.21.6, v1.29.5
Cloud provider or hardware configuration: On-premise
OS (e.g: cat /etc/os-release): Ubuntu 20.04.4 LTS
Kernel (e.g. uname -a): 5.4.0-113-generic
Install tools: nfd was deployed by gpu-operator from NVIDIA
Network plugin and version (if this is a network-related bug): calico

The text was updated successfully, but these errors were encountered:

jslouisyou added the kind/bug Categorizes issue or PR as related to a bug. label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-feature-discovery sends excessive LIST requests to the API server #1891

node-feature-discovery sends excessive LIST requests to the API server #1891

jslouisyou commented Sep 30, 2024

node-feature-discovery sends excessive LIST requests to the API server #1891

node-feature-discovery sends excessive LIST requests to the API server #1891

Comments

jslouisyou commented Sep 30, 2024