Modify Agent to reduce frequency of Pods getting `UnexpectedAdmissionError` #450

kate-goldenring · 2022-02-19T00:16:46Z

The Akri Agent creates device plugins for each device it discovers. The device plugin framework allows for specifying how many containers can use a device plugin at a time. The kubelet will then ensure that no Pod is scheduled to use that resource if all usage has been taken up.

If the kubelet believes that a resource/device plugin should be available but the device plugin/ Agent responds that it is not available, the Pod will show an UnexpectedAdmissionError.

Currently, the Agent errors when kubelet requests to use a device usage slot it is already using. Even though the kubelet is the source of truth, the Agent was erroring here in order to free up the usage slot so that any node could take it. In short, the Agent is actively creating UnexpectedAdmissionErrors on behalf of Node fairness for shared devices.

This behavior should change to reduce UnexpectedAdmissionErrors. The Agent should trust kubelet as the source of truth. This will lead to less admission errors, especially in Jobs scenarios like the following:

helm upgrade akri akri-helm-charts/akri-dev  \
    $AKRI_HELM_CRICTL_CONFIGURATION  \
    --set agent.allowDebugEcho=true  \
    --set debugEcho.discovery.enabled=true  \
    --set debugEcho.configuration.brokerJob.image.repository=busybox  \
    --set debugEcho.configuration.brokerJob.command[0]="sh"  \
    --set debugEcho.configuration.brokerJob.command[1]="-c"  \
    --set debugEcho.configuration.brokerJob.command[2]="echo 'Hello Amazing World'"  \
    --set debugEcho.configuration.brokerJob.command[3]="sleep 5"  \
    --set debugEcho.configuration.brokerJob.parallelism=2  \
    --set debugEcho.configuration.brokerJob.completions=2  \
    --set debugEcho.configuration.enabled=true  \
    --set debugEcho.configuration.capacity=1 \
    --set debugEcho.configuration.shared=true

The text was updated successfully, but these errors were encountered:

github-actions · 2022-05-31T00:02:09Z

Issue has been automatically marked as stale due to inactivity for 90 days. Update the issue to remove label, otherwise it will be automatically closed.

meibensteiner · 2023-01-23T09:49:07Z

This is an adoption blocker for me. Im using Akri for its device plugin capabilities and Im getting UnexpectedAdmissionErrors with every new deployment.

kate-goldenring · 2023-01-23T16:09:56Z

@meibensteiner this may be because capacity is at its default (of 1). I'd recommend increasing it: --set udev.configuration.capacity=5

kate-goldenring · 2023-01-23T16:12:27Z

Currently, the Agent errors when kubelet requests to use a device usage slot it is already using. Even though the kubelet is the source of truth, the Agent was erroring here in order to free up the usage slot so that any node could take it. In short, the Agent is actively creating UnexpectedAdmissionErrors on behalf of Node fairness for shared devices.

This is not enabling any fairness for unshared devices, so another reason to change the behavior in particular for those devices. Though, we may want to be consistent and do it for all devices. Best bet is probably to make this configurable in the Agent

meibensteiner · 2023-01-26T13:20:42Z

Wouldn't this allow multiple pods to access a single device? Feels very hacky.

kate-goldenring · 2023-01-26T17:57:50Z

Yes it would allow that. When you are getting the UnexpectedAdmissionErrors, are they eventually retrying and succeeding?

meibensteiner · 2023-01-30T11:49:22Z

They are, after a few minutes. But still, this triggers alerts in our monitoring.

kate-goldenring · 2023-01-30T16:16:06Z

@meibensteiner makes sense. Then, I'd say this is higher priority. I'll try to look into it this week.

bfjelds added the enhancement New feature or request label Mar 1, 2022

github-actions bot added the stale label May 31, 2022

kate-goldenring added keep-alive and removed stale labels May 31, 2022

kate-goldenring mentioned this issue Feb 17, 2023

Modify Agent to reduce frequency of Pods getting UnexpectedAdmissionError #556

Merged

8 tasks

yujinkim-msft closed this as completed in #556 Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Agent to reduce frequency of Pods getting `UnexpectedAdmissionError` #450

Modify Agent to reduce frequency of Pods getting `UnexpectedAdmissionError` #450

kate-goldenring commented Feb 19, 2022 •

edited

Loading

github-actions bot commented May 31, 2022

meibensteiner commented Jan 23, 2023

kate-goldenring commented Jan 23, 2023

kate-goldenring commented Jan 23, 2023

meibensteiner commented Jan 26, 2023

kate-goldenring commented Jan 26, 2023

meibensteiner commented Jan 30, 2023

kate-goldenring commented Jan 30, 2023

Modify Agent to reduce frequency of Pods getting UnexpectedAdmissionError #450

Modify Agent to reduce frequency of Pods getting UnexpectedAdmissionError #450

Comments

kate-goldenring commented Feb 19, 2022 • edited Loading

github-actions bot commented May 31, 2022

meibensteiner commented Jan 23, 2023

kate-goldenring commented Jan 23, 2023

kate-goldenring commented Jan 23, 2023

meibensteiner commented Jan 26, 2023

kate-goldenring commented Jan 26, 2023

meibensteiner commented Jan 30, 2023

kate-goldenring commented Jan 30, 2023

Modify Agent to reduce frequency of Pods getting `UnexpectedAdmissionError` #450

Modify Agent to reduce frequency of Pods getting `UnexpectedAdmissionError` #450

kate-goldenring commented Feb 19, 2022 •

edited

Loading