Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA allocatable resources changed to 0 after kubelet restart #74

Open
WulixuanS opened this issue Jul 11, 2023 · 6 comments
Open

RDMA allocatable resources changed to 0 after kubelet restart #74

WulixuanS opened this issue Jul 11, 2023 · 6 comments

Comments

@WulixuanS
Copy link

WulixuanS commented Jul 11, 2023

Version: v1.3.2

RDMA device plugin log:
企业微信截图_e57917c9-5e83-410b-9583-42354443b0a5

As can be seen from the log, when kubelet restart, it triggers context canceled and restart will block because channel size is 0, context listener added in this issue: #51.

When the kubelet restarts, ListAndWatch will receive the event from the stop channel, there is no need to watch context, so I fixed the bug by removing the context listener. If necessary, i can submit a PR.

func (rs *resourceServer) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
        resp := new(pluginapi.ListAndWatchResponse)


        // Send initial list of devices
        if err := rs.sendDevices(resp, s); err != nil {
                return err
        }


        for {
                select {
                case <-s.Context().Done():
                        log.Printf("ListAndWatch stream close: %v", s.Context().Err())
                        return nil
                case <-rs.stop:
                        return nil
                case d := <-rs.health:
                        // FIXME: there is no way to recover from the Unhealthy state.
                        d.Health = pluginapi.Unhealthy
                        _ = s.Send(&pluginapi.ListAndWatchResponse{Devices: rs.devs})
                case <-rs.updateResource:
                        if err := rs.sendDevices(resp, s); err != nil {
                                // The old stream may not be closed properly, return to close it
                                // and pass the update event to the new stream for processing
                                rs.updateResource <- true
                                return err
                        }
                }
        }
}



func (rs *resourceServer) Restart() error {
        log.Printf("restarting %s device plugin server...", rs.resourceName)
        if rs.rsConnector == nil || rs.rsConnector.GetServer() == nil {
                return fmt.Errorf("grpc server instance not found for %s", rs.resourceName)
        }


        rs.rsConnector.Stop()
        rs.rsConnector.DeleteServer()


        // Send terminate signal to ListAndWatch()
        rs.stop <- true


        return rs.Start()
}
@WulixuanS
Copy link
Author

WulixuanS commented Jul 11, 2023

cc @adrianchiris

@adrianchiris
Copy link
Collaborator

adrianchiris commented Aug 13, 2023

What is the K8s version you are using ?

i see in the logs:

Using Deprecated Device Plugin Registry path

does the following path exist in your system: /var/lib/kubelet/plugins_registry ?

@adrianchiris
Copy link
Collaborator

please check #82 it should solve the issue.

@adrianchiris
Copy link
Collaborator

v1.4.0 is out please check :)

@hvp4
Copy link

hvp4 commented Jan 8, 2024

@adrianchiris v1.4.0 release seems to be broken, cannot find the release:
docker pull nvcr.io/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.4.0 Error response from daemon: manifest for nvcr.io/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.4.0 not found: manifest unknown: manifest unknown

Nor can be seen here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/k8s-rdma-shared-dev-plugin/tags

@adrianchiris
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants