-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequent operator crashes #476
Comments
Why this request to the API server takes so long?
Is it under heavy load? We could probably increase the context timeout for the leaderelection part, but the root cause seems to be not in the operator. |
Some of the tests being run in the cluster involve deploying an unconventionally large amount of PODs into the same machine, which in this particular case happened to also be the one running the operator. There is heavy load related to scheduling and gpu-plugin resource allocating and releasing. "Large amount" here means hundreds of PODs. The node POD count limit has been increased from the standard 110. |
Recent client-go voluntarily limits the rate at which it sends requests: https://github.com/kubernetes/client-go/blob/5521967004d84d9e6f89df86dfeb5977f993bcbd/rest/request.go#L847-L852 If the operator is doing lots of requests during reconciliation, then this will potentially delay the requests that leadership election needs to do periodically. The solution is to use two clients, one for leader election and one for the actual work: https://github.com/kubernetes-csi/external-provisioner/blob/080d35df20983a57cc0d1da514b1654822998e94/cmd/csi-provisioner/csi-provisioner.go#L367-L371 |
Thank you! We need to do the same. |
This made me wonder still. Our reconciler is not that complex and should not be doing lots of requests. Is something sub-optimal? |
Not reproducible anymore -> closing. Please reopen if it happens again. |
@uniemimu reported frequent crashes with the operator.
error snippet:
The text was updated successfully, but these errors were encountered: