NFD Master memory leak #1614

tatodorov · 2024-03-14T12:28:37Z

What happened:
Node Feature Discovery Master continuously utilizes more memory and never release it.
I am running NVIDIA GPU Operator 23.9.2 and Node Feature Discovery 0.14.4 on Kubernetes 1.24.6.
Over a period of 1 hour I observe how the NFD master reaches 2 GB of memory.
I had to set a memory limit, since couple of times it exhausted the entire memory of the host.
Also, I configured NFD GC to run garbage collection every minute.
However, this didn't lead to release of memory.
Currently, I removed the GPU Operator, NFD Workers and NFD GC.
The memory usage of NFD Master keeps increasing.
Every minute, I can see the following in NFD Master's log:

I0314 12:21:30.594297       1 nfd-master.go:280] "reloading configuration"
I0314 12:21:30.594616       1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0314 12:21:30.594887       1 nfd-master.go:1274] "configuration successfully updated" configuration=<
        DenyLabelNs: {}
        EnableTaints: false
        ExtraLabelNs:
          nvidia.com: {}
        Klog: {}
        LabelWhiteList: {}
        LeaderElection:
          LeaseDuration:
            Duration: 15000000000
          RenewDeadline:
            Duration: 10000000000
          RetryPeriod:
            Duration: 2000000000
        NfdApiParallelism: 10
        NoPublish: false
        ResourceLabels: {}
        ResyncPeriod:
          Duration: 3600000000000
 >
I0314 12:21:30.594905       1 nfd-master.go:287] "stopping the nfd api controller"
I0314 12:21:30.594915       1 nfd-master.go:1338] "starting the nfd api controller"
I0314 12:21:30.595201       1 node-updater-pool.go:106] "stopping the NFD master node updater pool"
I0314 12:21:30.596167       1 node-updater-pool.go:81] "starting the NFD master node updater pool" parallelism=10
I0314 12:21:31.931613       1 nfd-master.go:694] "will process all nodes in the cluster"

This is the content of the ConfigMap mounted to NFD Master:

data:
  nfd-master.conf: |-
    extraLabelNs:
    - nvidia.com

What you expected to happen:
NFD Master to have a steady memory usage.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.24.6
Cloud provider or hardware configuration: 4x AMD EPYC 7543 - 128 cores in total
OS (e.g: cat /etc/os-release): Ubuntu 22.04.3 LTS (Jammy Jellyfish)
Kernel (e.g. uname -a): 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug):
Others: NVIDIA GPU Operator 23.9.2

The text was updated successfully, but these errors were encountered:

ArangoGutierrez · 2024-03-14T13:31:55Z

Thanks for the report @tatodorov , @marquiz is looking into it. we will provide updates as soon as possible

marquiz · 2024-03-14T13:45:12Z

Thanks @tatodorov for reporting this (with detailed description). On a quick analysis/testing #1615b should fix this.

This probably is not commonly encountered as people don't see frequent nfd-master config file updates (which hides the problem).

ArangoGutierrez · 2024-03-14T15:51:14Z

Hi @tatodorov thanks again for reporting this issue, a fix patch has been merged into the release branch, would you help us test it in your environment before cutting a patch release?

ArangoGutierrez · 2024-03-14T15:56:07Z

The image should be gcr.io/k8s-staging-nfd/node-feature-discovery:release-0.14 once https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-node-feature-discovery-push-images/1768301977449533440 runs to completion

tatodorov · 2024-03-14T16:30:42Z

@ArangoGutierrez and @marquiz , thank you very much for your prompt reply!
I actually applied the patch from PR #1622 and built a new 0.14.4 binary and a container image.
Currently running the new release and at the moment, I don't observe a memory leak.
Though, will be able to confirm that for sure after running it for a couple of hours.

ArangoGutierrez · 2024-03-15T07:31:13Z

Looking forward for your overnight report @tatodorov

marquiz · 2024-03-15T09:17:12Z

Looking forward for your overnight report @tatodorov

Yeah. @tatodorov we're prepared to cut new patch release(s) quickly when we get good-to-go signal. Based on my own testing the issue looks like fixed

tatodorov · 2024-03-15T09:20:21Z

@ArangoGutierrez, @marquiz I haven't observed the memory leak anymore.
So, from my point of view, PR #1617 resolves this issue.

Thank you very much for your assistance!

marquiz · 2024-03-15T12:51:30Z

@tatodorov NFD v0.14.5 (and v0.15.3) containing the fix has been released. I'm closing this issue now. Please re-open (or create a new issue) if you encounter any further issues
/close

k8s-ci-robot · 2024-03-15T12:51:35Z

@marquiz: Closing this issue.

In response to this:

@tatodorov NFD v0.14.5 (and v0.15.3) containing the fix has been released. I'm closing this issue now. Please re-open (or create a new issue) if you encounter any further issues
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tatodorov added the kind/bug Categorizes issue or PR as related to a bug. label Mar 14, 2024

ArangoGutierrez added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 14, 2024

ArangoGutierrez assigned marquiz and ArangoGutierrez Mar 14, 2024

marquiz mentioned this issue Mar 14, 2024

nfd-master: fix memory leak in nfd api-controller #1615

Merged

This was referenced Mar 14, 2024

Specify resource limits (and requests) in NFD deployment files #1616

Closed

Release v0.14.5 #1617

Closed

Release v0.15.3 #1618

Closed

This was referenced Mar 15, 2024

[release-0.14] Update references to release v0.14.5 #1626

Merged

[release-0.15] Update references to release v0.15.3 #1627

Merged

k8s-ci-robot closed this as completed Mar 15, 2024

marquiz mentioned this issue Mar 25, 2024

chore/deployment: add resources requests and limits for helm and Kustomize #1631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFD Master memory leak #1614

NFD Master memory leak #1614

tatodorov commented Mar 14, 2024 •

edited

Loading

ArangoGutierrez commented Mar 14, 2024

marquiz commented Mar 14, 2024

ArangoGutierrez commented Mar 14, 2024

ArangoGutierrez commented Mar 14, 2024 •

edited

Loading

tatodorov commented Mar 14, 2024

ArangoGutierrez commented Mar 15, 2024

marquiz commented Mar 15, 2024

tatodorov commented Mar 15, 2024

marquiz commented Mar 15, 2024

k8s-ci-robot commented Mar 15, 2024

NFD Master memory leak #1614

NFD Master memory leak #1614

Comments

tatodorov commented Mar 14, 2024 • edited Loading

ArangoGutierrez commented Mar 14, 2024

marquiz commented Mar 14, 2024

ArangoGutierrez commented Mar 14, 2024

ArangoGutierrez commented Mar 14, 2024 • edited Loading

tatodorov commented Mar 14, 2024

ArangoGutierrez commented Mar 15, 2024

marquiz commented Mar 15, 2024

tatodorov commented Mar 15, 2024

marquiz commented Mar 15, 2024

k8s-ci-robot commented Mar 15, 2024

tatodorov commented Mar 14, 2024 •

edited

Loading

ArangoGutierrez commented Mar 14, 2024 •

edited

Loading