Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent operator crashes #476

Closed
mythi opened this issue Oct 16, 2020 · 6 comments
Closed

frequent operator crashes #476

mythi opened this issue Oct 16, 2020 · 6 comments
Assignees
Labels
operator Device operator related issue

Comments

@mythi
Copy link
Contributor

mythi commented Oct 16, 2020

@uniemimu reported frequent crashes with the operator.

error snippet:

I1016 08:30:50.380630       1 request.go:621] Throttling request took 4.500242089s, request: GET:https://10.96.0.1:443/api/v1/namespaces/inteldeviceplugins-system/configmaps/d1c7b6d5.intel.com
2020-10-16T08:30:51.083Z	INFO	intel-device-plugins-manager.controller	Starting Controller	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "QatDevicePlugin", "controller": "qatdeviceplugin"}
2020-10-16T08:30:51.086Z	INFO	intel-device-plugins-manager.controller	Starting Controller	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "GpuDevicePlugin", "controller": "gpudeviceplugin"}
2020-10-16T08:30:51.886Z	INFO	intel-device-plugins-manager.controller	Starting Controller	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "SgxDevicePlugin", "controller": "sgxdeviceplugin"}
2020-10-16T08:30:53.280Z	INFO	intel-device-plugins-manager.controller	Starting workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "GpuDevicePlugin", "controller": "gpudeviceplugin", "worker count": 1}
2020-10-16T08:30:53.280Z	INFO	intel-device-plugins-manager.controller	Starting workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "SgxDevicePlugin", "controller": "sgxdeviceplugin", "worker count": 1}
2020-10-16T08:30:53.186Z	INFO	intel-device-plugins-manager.controller	Starting workers	{"reconcilerGroup": "fpga.intel.com", "reconcilerKind": "AcceleratorFunction", "controller": "acceleratorfunction", "worker count": 1}
2020-10-16T08:30:53.186Z	INFO	intel-device-plugins-manager.controller	Starting workers	{"reconcilerGroup": "fpga.intel.com", "reconcilerKind": "FpgaRegion", "controller": "fpgaregion", "worker count": 1}
2020-10-16T08:30:53.280Z	INFO	intel-device-plugins-manager.controller	Starting workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "QatDevicePlugin", "controller": "qatdeviceplugin", "worker count": 1}
2020-10-16T08:30:53.280Z	INFO	intel-device-plugins-manager.controller	Starting EventSource	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "FpgaDevicePlugin", "controller": "fpgadeviceplugin", "source": "kind source: /, Kind="}
2020-10-16T08:30:53.682Z	INFO	intel-device-plugins-manager.controller	Starting Controller	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "FpgaDevicePlugin", "controller": "fpgadeviceplugin"}
2020-10-16T08:30:53.985Z	INFO	intel-device-plugins-manager.controller	Starting workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "FpgaDevicePlugin", "controller": "fpgadeviceplugin", "worker count": 1}
E1016 08:30:55.194944       1 leaderelection.go:320] error retrieving resource lock inteldeviceplugins-system/d1c7b6d5.intel.com: Get https://10.96.0.1:443/api/v1/namespaces/inteldeviceplugins-system/configmaps/d1c7b6d5.intel.com: context deadline exceeded
I1016 08:30:55.289333       1 leaderelection.go:277] failed to renew lease inteldeviceplugins-system/d1c7b6d5.intel.com: timed out waiting for the condition
2020-10-16T08:30:55.297Z	INFO	intel-device-plugins-manager.controller	Stopping workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "GpuDevicePlugin", "controller": "gpudeviceplugin"}
2020-10-16T08:30:55.299Z	INFO	intel-device-plugins-manager.controller	Stopping workers	{"reconcilerGroup": "fpga.intel.com", "reconcilerKind": "FpgaRegion", "controller": "fpgaregion"}
2020-10-16T08:30:55.299Z	INFO	intel-device-plugins-manager.controller	Stopping workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "SgxDevicePlugin", "controller": "sgxdeviceplugin"}
2020-10-16T08:30:55.299Z	INFO	intel-device-plugins-manager.controller	Stopping workers	{"reconcilerGroup": "fpga.intel.com", "reconcilerKind": "AcceleratorFunction", "controller": "acceleratorfunction"}
2020-10-16T08:30:55.300Z	INFO	intel-device-plugins-manager.controller	Stopping workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "FpgaDevicePlugin", "controller": "fpgadeviceplugin"}
2020-10-16T08:30:55.300Z	INFO	intel-device-plugins-manager.controller	Stopping workers	{"reconcilerGroup": "deviceplugin.intel.com", "reconcilerKind": "QatDevicePlugin", "controller": "qatdeviceplugin"}
2020-10-16T08:30:55.787Z	INFO	controller-runtime.webhook	shutting down webhook server
2020-10-16T08:30:55.380Z	ERROR	setup	problem running manager	{"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
main.main
	/intel-device-plugins-for-kubernetes/cmd/operator/main.go:145
runtime.main
	/usr/lib/golang/src/runtime/proc.go:203
@mythi mythi added the operator Device operator related issue label Oct 16, 2020
@rojkov
Copy link
Contributor

rojkov commented Oct 16, 2020

Why this request to the API server takes so long?

I1016 08:30:50.380630       1 request.go:621] Throttling request took 4.500242089s, request: GET:https://10.96.0.1:443/api/v1/namespaces/inteldeviceplugins-system/configmaps/d1c7b6d5.intel.com

Is it under heavy load?

We could probably increase the context timeout for the leaderelection part, but the root cause seems to be not in the operator.

@uniemimu
Copy link
Contributor

Why this request to the API server takes so long?

I1016 08:30:50.380630       1 request.go:621] Throttling request took 4.500242089s, request: GET:https://10.96.0.1:443/api/v1/namespaces/inteldeviceplugins-system/configmaps/d1c7b6d5.intel.com

Is it under heavy load?

We could probably increase the context timeout for the leaderelection part, but the root cause seems to be not in the operator.

Some of the tests being run in the cluster involve deploying an unconventionally large amount of PODs into the same machine, which in this particular case happened to also be the one running the operator. There is heavy load related to scheduling and gpu-plugin resource allocating and releasing.

"Large amount" here means hundreds of PODs. The node POD count limit has been increased from the standard 110.

@pohly
Copy link

pohly commented Oct 16, 2020

Why this request to the API server takes so long?

Recent client-go voluntarily limits the rate at which it sends requests: https://github.com/kubernetes/client-go/blob/5521967004d84d9e6f89df86dfeb5977f993bcbd/rest/request.go#L847-L852

If the operator is doing lots of requests during reconciliation, then this will potentially delay the requests that leadership election needs to do periodically.

The solution is to use two clients, one for leader election and one for the actual work: https://github.com/kubernetes-csi/external-provisioner/blob/080d35df20983a57cc0d1da514b1654822998e94/cmd/csi-provisioner/csi-provisioner.go#L367-L371

@rojkov
Copy link
Contributor

rojkov commented Oct 16, 2020

The solution is to use two clients, one for leader election and one for the actual work

Thank you! We need to do the same.

@mythi
Copy link
Contributor Author

mythi commented Oct 19, 2020

If the operator is doing lots of requests during reconciliation

This made me wonder still. Our reconciler is not that complex and should not be doing lots of requests. Is something sub-optimal?

@msivosuo
Copy link

Not reproducible anymore -> closing. Please reopen if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operator Device operator related issue
Projects
None yet
Development

No branches or pull requests

5 participants