-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify Agent to reduce frequency of Pods getting UnexpectedAdmissionError
#450
Comments
Issue has been automatically marked as stale due to inactivity for 90 days. Update the issue to remove label, otherwise it will be automatically closed. |
This is an adoption blocker for me. Im using Akri for its device plugin capabilities and Im getting UnexpectedAdmissionErrors with every new deployment. |
@meibensteiner this may be because |
This is not enabling any fairness for unshared devices, so another reason to change the behavior in particular for those devices. Though, we may want to be consistent and do it for all devices. Best bet is probably to make this configurable in the Agent |
Wouldn't this allow multiple pods to access a single device? Feels very hacky. |
Yes it would allow that. When you are getting the UnexpectedAdmissionErrors, are they eventually retrying and succeeding? |
They are, after a few minutes. But still, this triggers alerts in our monitoring. |
@meibensteiner makes sense. Then, I'd say this is higher priority. I'll try to look into it this week. |
The Akri Agent creates device plugins for each device it discovers. The device plugin framework allows for specifying how many containers can use a device plugin at a time. The kubelet will then ensure that no Pod is scheduled to use that resource if all usage has been taken up.
If the kubelet believes that a resource/device plugin should be available but the device plugin/ Agent responds that it is not available, the Pod will show an
UnexpectedAdmissionError
.Currently, the Agent errors when kubelet requests to use a device usage slot it is already using. Even though the kubelet is the source of truth, the Agent was erroring here in order to free up the usage slot so that any node could take it. In short, the Agent is actively creating
UnexpectedAdmissionErrors
on behalf of Node fairness for shared devices.This behavior should change to reduce
UnexpectedAdmissionErrors
. The Agent should trust kubelet as the source of truth. This will lead to less admission errors, especially in Jobs scenarios like the following:The text was updated successfully, but these errors were encountered: