Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFD Master memory leak #1614

Closed
tatodorov opened this issue Mar 14, 2024 · 10 comments
Closed

NFD Master memory leak #1614

tatodorov opened this issue Mar 14, 2024 · 10 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@tatodorov
Copy link

tatodorov commented Mar 14, 2024

What happened:
Node Feature Discovery Master continuously utilizes more memory and never release it.
I am running NVIDIA GPU Operator 23.9.2 and Node Feature Discovery 0.14.4 on Kubernetes 1.24.6.
Over a period of 1 hour I observe how the NFD master reaches 2 GB of memory.
I had to set a memory limit, since couple of times it exhausted the entire memory of the host.
Also, I configured NFD GC to run garbage collection every minute.
However, this didn't lead to release of memory.
Currently, I removed the GPU Operator, NFD Workers and NFD GC.
The memory usage of NFD Master keeps increasing.
Every minute, I can see the following in NFD Master's log:

I0314 12:21:30.594297       1 nfd-master.go:280] "reloading configuration"
I0314 12:21:30.594616       1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0314 12:21:30.594887       1 nfd-master.go:1274] "configuration successfully updated" configuration=<
        DenyLabelNs: {}
        EnableTaints: false
        ExtraLabelNs:
          nvidia.com: {}
        Klog: {}
        LabelWhiteList: {}
        LeaderElection:
          LeaseDuration:
            Duration: 15000000000
          RenewDeadline:
            Duration: 10000000000
          RetryPeriod:
            Duration: 2000000000
        NfdApiParallelism: 10
        NoPublish: false
        ResourceLabels: {}
        ResyncPeriod:
          Duration: 3600000000000
 >
I0314 12:21:30.594905       1 nfd-master.go:287] "stopping the nfd api controller"
I0314 12:21:30.594915       1 nfd-master.go:1338] "starting the nfd api controller"
I0314 12:21:30.595201       1 node-updater-pool.go:106] "stopping the NFD master node updater pool"
I0314 12:21:30.596167       1 node-updater-pool.go:81] "starting the NFD master node updater pool" parallelism=10
I0314 12:21:31.931613       1 nfd-master.go:694] "will process all nodes in the cluster"

This is the content of the ConfigMap mounted to NFD Master:

data:
  nfd-master.conf: |-
    extraLabelNs:
    - nvidia.com

What you expected to happen:
NFD Master to have a steady memory usage.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.24.6
  • Cloud provider or hardware configuration: 4x AMD EPYC 7543 - 128 cores in total
  • OS (e.g: cat /etc/os-release): Ubuntu 22.04.3 LTS (Jammy Jellyfish)
  • Kernel (e.g. uname -a): 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others: NVIDIA GPU Operator 23.9.2
@tatodorov tatodorov added the kind/bug Categorizes issue or PR as related to a bug. label Mar 14, 2024
@ArangoGutierrez
Copy link
Contributor

Thanks for the report @tatodorov , @marquiz is looking into it. we will provide updates as soon as possible

@ArangoGutierrez ArangoGutierrez added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 14, 2024
@marquiz
Copy link
Contributor

marquiz commented Mar 14, 2024

Thanks @tatodorov for reporting this (with detailed description). On a quick analysis/testing #1615b should fix this.

This probably is not commonly encountered as people don't see frequent nfd-master config file updates (which hides the problem).

@ArangoGutierrez
Copy link
Contributor

Hi @tatodorov thanks again for reporting this issue, a fix patch has been merged into the release branch, would you help us test it in your environment before cutting a patch release?

@ArangoGutierrez
Copy link
Contributor

ArangoGutierrez commented Mar 14, 2024

The image should be gcr.io/k8s-staging-nfd/node-feature-discovery:release-0.14 once https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-node-feature-discovery-push-images/1768301977449533440 runs to completion

@tatodorov
Copy link
Author

@ArangoGutierrez and @marquiz , thank you very much for your prompt reply!
I actually applied the patch from PR #1622 and built a new 0.14.4 binary and a container image.
Currently running the new release and at the moment, I don't observe a memory leak.
Though, will be able to confirm that for sure after running it for a couple of hours.

@ArangoGutierrez
Copy link
Contributor

Looking forward for your overnight report @tatodorov

@marquiz
Copy link
Contributor

marquiz commented Mar 15, 2024

Looking forward for your overnight report @tatodorov

Yeah. @tatodorov we're prepared to cut new patch release(s) quickly when we get good-to-go signal. Based on my own testing the issue looks like fixed

@tatodorov
Copy link
Author

@ArangoGutierrez, @marquiz I haven't observed the memory leak anymore.
So, from my point of view, PR #1617 resolves this issue.

Thank you very much for your assistance!

@marquiz
Copy link
Contributor

marquiz commented Mar 15, 2024

@tatodorov NFD v0.14.5 (and v0.15.3) containing the fix has been released. I'm closing this issue now. Please re-open (or create a new issue) if you encounter any further issues
/close

@k8s-ci-robot
Copy link
Contributor

@marquiz: Closing this issue.

In response to this:

@tatodorov NFD v0.14.5 (and v0.15.3) containing the fix has been released. I'm closing this issue now. Please re-open (or create a new issue) if you encounter any further issues
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

4 participants