Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Agent to reduce frequency of Pods getting UnexpectedAdmissionError #450

Closed
kate-goldenring opened this issue Feb 19, 2022 · 8 comments · Fixed by #556
Closed

Modify Agent to reduce frequency of Pods getting UnexpectedAdmissionError #450

kate-goldenring opened this issue Feb 19, 2022 · 8 comments · Fixed by #556
Labels
enhancement New feature or request keep-alive

Comments

@kate-goldenring
Copy link
Contributor

kate-goldenring commented Feb 19, 2022

The Akri Agent creates device plugins for each device it discovers. The device plugin framework allows for specifying how many containers can use a device plugin at a time. The kubelet will then ensure that no Pod is scheduled to use that resource if all usage has been taken up.

If the kubelet believes that a resource/device plugin should be available but the device plugin/ Agent responds that it is not available, the Pod will show an UnexpectedAdmissionError.

Currently, the Agent errors when kubelet requests to use a device usage slot it is already using. Even though the kubelet is the source of truth, the Agent was erroring here in order to free up the usage slot so that any node could take it. In short, the Agent is actively creating UnexpectedAdmissionErrors on behalf of Node fairness for shared devices.

This behavior should change to reduce UnexpectedAdmissionErrors. The Agent should trust kubelet as the source of truth. This will lead to less admission errors, especially in Jobs scenarios like the following:

helm upgrade akri akri-helm-charts/akri-dev  \
    $AKRI_HELM_CRICTL_CONFIGURATION  \
    --set agent.allowDebugEcho=true  \
    --set debugEcho.discovery.enabled=true  \
    --set debugEcho.configuration.brokerJob.image.repository=busybox  \
    --set debugEcho.configuration.brokerJob.command[0]="sh"  \
    --set debugEcho.configuration.brokerJob.command[1]="-c"  \
    --set debugEcho.configuration.brokerJob.command[2]="echo 'Hello Amazing World'"  \
    --set debugEcho.configuration.brokerJob.command[3]="sleep 5"  \
    --set debugEcho.configuration.brokerJob.parallelism=2  \
    --set debugEcho.configuration.brokerJob.completions=2  \
    --set debugEcho.configuration.enabled=true  \
    --set debugEcho.configuration.capacity=1 \
    --set debugEcho.configuration.shared=true 
@bfjelds bfjelds added the enhancement New feature or request label Mar 1, 2022
@github-actions
Copy link
Contributor

Issue has been automatically marked as stale due to inactivity for 90 days. Update the issue to remove label, otherwise it will be automatically closed.

@meibensteiner
Copy link

This is an adoption blocker for me. Im using Akri for its device plugin capabilities and Im getting UnexpectedAdmissionErrors with every new deployment.

@kate-goldenring
Copy link
Contributor Author

@meibensteiner this may be because capacity is at its default (of 1). I'd recommend increasing it: --set udev.configuration.capacity=5

@kate-goldenring
Copy link
Contributor Author

Currently, the Agent errors when kubelet requests to use a device usage slot it is already using. Even though the kubelet is the source of truth, the Agent was erroring here in order to free up the usage slot so that any node could take it. In short, the Agent is actively creating UnexpectedAdmissionErrors on behalf of Node fairness for shared devices.

This is not enabling any fairness for unshared devices, so another reason to change the behavior in particular for those devices. Though, we may want to be consistent and do it for all devices. Best bet is probably to make this configurable in the Agent

@meibensteiner
Copy link

Wouldn't this allow multiple pods to access a single device? Feels very hacky.

@kate-goldenring
Copy link
Contributor Author

Yes it would allow that. When you are getting the UnexpectedAdmissionErrors, are they eventually retrying and succeeding?

@meibensteiner
Copy link

They are, after a few minutes. But still, this triggers alerts in our monitoring.

@kate-goldenring
Copy link
Contributor Author

@meibensteiner makes sense. Then, I'd say this is higher priority. I'll try to look into it this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request keep-alive
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants