Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-operator-nfd-worker fails to read net interface attribute speed #658

Open
6 tasks
blackliner opened this issue Jan 19, 2024 · 8 comments
Open
6 tasks

Comments

@blackliner
Copy link

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 6.5.0-14
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd v1.7.5
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): bare metal k8s via kubespray/kubeadm v1.27.7
  • GPU Operator Version: nvcr.io/nvidia/gpu-operator:v23.9.1

2. Issue or feature description

On Supermicro motherboards from the X12/H12 series with RoT (Root of Trust) function, an additional, virtual network interface appears in the operating system. Under Linux, its device name is enx+MAC (e.g. enxb03af2b6059f).

It looks like these USB ethernet gadget devices do not support reading out the speed property:

E0119 17:06:12.398921       1 network.go:143] "failed to read net iface attribute" err="read /host-sys/class/net/enxbe3af2b6059f/speed: invalid argument" attributeName="speed"

On the host

cat /sys/class/net/enxbe3af2b6059f/speed
cat: /sys/class/net/enxbe3af2b6059f/speed: Invalid argument

Network details

ip a show enxbe3af2b6059f
4: enxbe3af2b6059f: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether be:3a:f2:b6:05:9f brd ff:ff:ff:ff:ff:ff

Ethtool also doesnt show any relevant info

sudo ethtool -i enxbe3af2b6059f
driver: rndis_host
version: 6.5.0-14-generic
firmware-version: RNDIS device
expansion-rom-version:
bus-info: usb-0000:03:00.4-1.2
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

Relevant links:

3. Steps to reproduce the issue

Get one of the newest products from Supermicro with this "feature", install gpu operator and see how nfd fails to register any labels on that node.

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

@cdesiniotis
Copy link
Contributor

@blackliner thanks for reporting this. This was also reported to NFD upstream: kubernetes-sigs/node-feature-discovery#1556 and a fix has been merged: kubernetes-sigs/node-feature-discovery#1557. The fix is not in a released version of NFD yet, but we will pick it up as soon as it is out.

cc @ArangoGutierrez

@hex2f
Copy link

hex2f commented May 16, 2024

getting this exact issue but on the loopback device instead

E0516 11:54:30.862625       1 network.go:152] "failed to read net iface attribute" err="read /host-sys/class/net/lo/speed: invalid argument" attributeName="speed"

any updates on when a fix for this might be merged / if there are any workarounds? would be massively appreciated 🥲🙏

@Parth-Sabhadiya-Imprivata

I am getting the same error
1 network.go:152] "failed to read net iface attribute" err="read /host-sys/class/net/lo/speed: invalid argument" attributeName="speed"

@zzorica
Copy link

zzorica commented May 29, 2024

Same error on GKE, ubuntu_containerd nodes with gpu-operator version v24.3.0 and nfd registry.k8s.io/nfd/node-feature-discovery:v0.15.4

@vglukhik
Copy link

vglukhik commented Jun 5, 2024

same error

@DavraYoung
Copy link

DavraYoung commented Jun 22, 2024

same error on k3s agent. The master node works fine

@ZYWNB666
Copy link

GKE 上出现相同错误,ubuntu_containerd具有 gpu-operator 版本v24.3.0和 nfd的节点registry.k8s.io/nfd/node-feature-discovery:v0.15.4

Use the latest version 0.17/0.16.* to fix this problem

@cdesiniotis
Copy link
Contributor

@blackliner can you verify if this issue is fixed in the latest GPU Operator version? GPU Operator 24.6.1 uses NFD v0.16.3, which should contain the fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants