Add Configuration device plugin #627

johnsonshih · 2023-07-03T07:23:34Z

What this PR does / why we need it:
This PR is to implement the proposal of exposing resource at Configuration level. https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md#configuration-level-resources
With Configuration level resource, users can select resource to use at configuration level without knowing the instance id beforehand.
Special notes for your reviewer:
The implementation is described in the document PR:project-akri/akri-docs#76

Summary of changes in this PR

CL resource and IL resource share the capacity pool. i.e. (# of allocated CL virtual devices + # of allocated IL virtual devices) <= capacity.
The name of CL device plugin is the Akri Configuration name and follows the same convention of IL device plugin, i.e., replace ['.', '/'] with "-".
The CL device plugin uses dynamic name for virtual device ids. The virtual device id reported by CL device plugin looks like "0", "1", "2", .... The actual device usage slot being used is selected when ConfigurationDevicePlugin::allocate() is called.
ConfigurationDevicePlugin represent the behavior of CL device plugin. InstanceDevicePlugin represents the behavior of IL device plugin.
DevicePluginService contains a list_and_watch_message_senders to notify refreshing list_and_watch, used by the DevicePluginService internally, a copy of list_and_watch_message_sender is stored in the associated InstanceInfo, used by external entity to refresh the InstanceDevicePlugin's list_and_watch.
ConfigurationDevicePlugin contains a list_and_watch_message_sender to notify refreshing list_and_watch, used by the ConfigurationDevicePlugin internally, a copy of list_and_watch_message_sender is store in the InstanceConfig, used by external entity to refresh the ConfigurationDevicePlugin's list_and_watch
When IL DPS allocate a virtual device, it notifies CL DPS to refresh list_and_watch, and vice versa, CL DPS notify IL DPS to refresh list_and_watch when it allocates a virtual device.

If applicable:

this PR has an associated PR with documentation in akri-docs
this PR contains unit tests
added code adheres to standard Rust formatting (cargo fmt)
code builds properly (cargo build)
code is free of common mistakes (cargo clippy)
all Akri tests succeed (cargo test)
inline documentation builds (cargo doc)
all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

Signed-off-by: Johnson Shih <[email protected]>

kate-goldenring · 2023-07-12T21:17:36Z

For history of discussions around this work, see previous PR #565

agent/src/util/device_plugin_service.rs

bfjelds · 2023-07-18T21:32:43Z

this is so awesome to see!

agent/src/util/device_plugin_service.rs

bfjelds · 2023-07-19T15:07:02Z

Regarding report already reserved virtual devices + 1 free device usage (if available) per instance, it seems to be a design decision to only allow CL to claim resources from differing nodes (so requesting 2 CL resources would only work if there were 2 nodes with 1 healthy slot ... 1 node with 2 healthy slots could not be scheduled to).

I assume the purpose of that is to ensure high availability. And certainly, that provides it from a node perspective.

I'm not sure I'd make that a hard requirement though ... or at least allow the user to pick if they want it to be a requirement or a preference. For example, if there was a configuration that specifieid NodeHA=preference (rather than requirement) and there was a single node with 2 healthy slots, the user could schedule their workload. This might provide HA from a device API perspective (maybe the camera has glitchy software) or from a network perspective (maybe the network connection to the camera isn't great).

bfjelds · 2023-07-19T15:11:29Z

sorry, i made a lot of nit comments about variable/function naming ... i think the comments can all boil down to making it easier to understand the code (for me) if:

variables that describe maps/vectors/collections/sets were plural
consistency in usage of words like slot vs device id, etc

bfjelds · 2023-07-19T15:12:01Z

overall, this looks super awesome and am very excited about you bringing this feature in!!!

Signed-off-by: Johnson Shih <[email protected]>

johnsonshih · 2023-07-21T00:04:10Z

Regarding report already reserved virtual devices + 1 free device usage (if available) per instance, it seems to be a design decision to only allow CL to claim resources from differing nodes (so requesting 2 CL resources would only work if there were 2 nodes with 1 healthy slot ... 1 node with 2 healthy slots could not be scheduled to).

I assume the purpose of that is to ensure high availability. And certainly, that provides it from a node perspective.

I'm not sure I'd make that a hard requirement though ... or at least allow the user to pick if they want it to be a requirement or a preference. For example, if there was a configuration that specifieid NodeHA=preference (rather than requirement) and there was a single node with 2 healthy slots, the user could schedule their workload. This might provide HA from a device API perspective (maybe the camera has glitchy software) or from a network perspective (maybe the network connection to the camera isn't great).

The algorithm of CL resource allocation ensures allocating slots from different devices for a container on a given node. For example, 2 devices are discovered and the capacity is 3. On nodeA, a container can allocate 1 or 2 slots and you can have multiple containers scheduled to nodeA (e.g. 6 container, each requests 1 slot, or 3 containers each requests 2 slots, or combination of 1-slot and 2-slot containers). On nodeB, the same algorithm is used and both nodes share the total 6 slots.

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.

In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.

If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

Signed-off-by: Johnson Shih <[email protected]>

agent/src/util/device_plugin_service.rs

Signed-off-by: Johnson Shih <[email protected]>

agent/src/util/device_plugin_service.rs

bfjelds · 2023-08-01T01:11:22Z

agent/src/util/device_plugin_service.rs

+                    &DeviceUsageKind::Configuration(vdev_id.clone()),
+                    &dps.node_name,
+                )
+                .unwrap();


i don't remember if there was some policy on when to return an error and when to unwrap, do we want to return an error here instead of unwrap call?

mostly, i just want to make sure that unwrap is called in appropriate and intentional places vs returning an error.

dps.list_and_watch_message_sender is for notifying list_and_watch thread within the same device plugin service to re-scan the device availability. Since it is always valid as long as the device plugin service is running, the unwrap() should always succeed, if not, then it's unexpected and we fail fast.

bfjelds · 2023-08-01T01:31:42Z

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.

In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.

If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

i can't seem to wrap my brain around this. i'm not sure i get why we would fail if a container requests 4 slots and there are 6. i can understand why we'd prefer to succeed using slots from 4 different devices, but i don't understand why we'd fail to schedule if we had enough slots for the 2 devices.

johnsonshih · 2023-08-01T03:39:29Z

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.
In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.
If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

i can't seem to wrap my brain around this. i'm not sure i get why we would fail if a container requests 4 slots and there are 6. i can understand why we'd prefer to succeed using slots from 4 different devices, but i don't understand why we'd fail to schedule if we had enough slots for the 2 devices.

There are actually only 2 devices even we reports 6 slots. When allocating virtual devices to a container, does Agent treat each slot as a different virtual device? or Agent maps the slot back to actual device and allocate different devices? If Agent treat each slot as a different virtual device, when a container request 4 virtual devices, it should allow it. If Agent maps slots back to actual devices, then it should reject the allocation since only 2 devices are available.

We had a long discussion about what behavior should be when facing this situation. I pasted the paragraph from the CL-resource proposal (https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md) below. The implementation is based to the concept of allocation by "unique device". If we think allocation by "unique slot" is a valid scenario, we can add that support. (I do have that implemented, we can add a field "uniqueDevice" in Configuration CRD to decide which allocation policy to use.)

There are two implementation options in the case where there are not enough unique Instances to meet the requested number of Configuration-level resources. One scenario where this could happen is if a Pod requests 3 cameras when only 2 exist with a capacity of 2 each. In this case, the Configuration-level camera resource shows as having a quantity of 4, despite there being two cameras. In this case, the kubelet will try to schedule the Pod, thinking there are enough resources. The Agent could either allocate 2 spots on one camera and one on the other or deny the allocation request. The latter is the preferred approach as it is more consistent and ensures the workload is getting the number of unique devices it expects. After failing an allocate request from the kubelet, the Pod will be in an UnexpectedAdmissionError state until another camera comes online and it can be successfully scheduled.

Signed-off-by: Johnson Shih <[email protected]>

bfjelds · 2023-08-01T15:09:03Z

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.
In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.
If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

i can't seem to wrap my brain around this. i'm not sure i get why we would fail if a container requests 4 slots and there are 6. i can understand why we'd prefer to succeed using slots from 4 different devices, but i don't understand why we'd fail to schedule if we had enough slots for the 2 devices.

There are actually only 2 devices even we reports 6 slots. When allocating virtual devices to a container, does Agent treat each slot as a different virtual device? or Agent maps the slot back to actual device and allocate different devices? If Agent treat each slot as a different virtual device, when a container request 4 virtual devices, it should allow it. If Agent maps slots back to actual devices, then it should reject the allocation since only 2 devices are available.

We had a long discussion about what behavior should be when facing this situation. I pasted the paragraph from the CL-resource proposal (https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md) below. The implementation is based to the concept of allocation by "unique device". If we think allocation by "unique slot" is a valid scenario, we can add that support. (I do have that implemented, we can add a field "uniqueDevice" in Configuration CRD to decide which allocation policy to use.)

There are two implementation options in the case where there are not enough unique Instances to meet the requested number of Configuration-level resources. One scenario where this could happen is if a Pod requests 3 cameras when only 2 exist with a capacity of 2 each. In this case, the Configuration-level camera resource shows as having a quantity of 4, despite there being two cameras. In this case, the kubelet will try to schedule the Pod, thinking there are enough resources. The Agent could either allocate 2 spots on one camera and one on the other or deny the allocation request. The latter is the preferred approach as it is more consistent and ensures the workload is getting the number of unique devices it expects. After failing an allocate request from the kubelet, the Pod will be in an UnexpectedAdmissionError state until another camera comes online and it can be successfully scheduled.

i guess i read preference in this line in the proposal: the Agent will preference unique Instances ... it is fine to preference unique instances, but if you don't have that, available access (capacity) should be acceptable.

bfjelds · 2023-08-02T00:19:35Z

i guess i read preference in this line in the proposal: the Agent will preference unique Instances ... it is fine to preference unique instances, but if you don't have that, available access (capacity) should be acceptable.

@kate-goldenring, I am way late to this party and I imagine you've done far more in-depth thinking on this than I have. My gut is that preferring unique instances is fine for CL, but that there should be a way to schedule to available slots (regardless of instance uniqueness). What is your take?

diconico07 · 2023-08-03T09:56:43Z

i guess i read preference in this line in the proposal: the Agent will preference unique Instances ... it is fine to preference unique instances, but if you don't have that, available access (capacity) should be acceptable.

@kate-goldenring, I am way late to this party and I imagine you've done far more in-depth thinking on this than I have. My gut is that preferring unique instances is fine for CL, but that there should be a way to schedule to available slots (regardless of instance uniqueness). What is your take?

@bjfield for me the main use case is "I want to deploy a workload with 2 (distinct) cameras", if I have a single camera available on a node then the workload is not scheduled on that node, but if you expose all slots, if a single camera has two slots then it will end up scheduled on the node even though that's not what you wanted.

The current implementation has a lot of unsolved corner cases here, as @johnsonshih told during community call there is a "race window" when a workload gets deleted where two device plugin slots will still be "available" even though we only want one, or a case where we have two containers on the same pod each asking for the 2 cameras that wouldn't be scheduled. But I don't really see any way to solve these with the Device Plugin API (maybe we could solve these with the new Dynamic Resource Allocation API, but I don't think we want to push it right now) without sacrificing the "If I want 2 distinct cameras, I don't want to end-up with two slots of the same camera" thing.

bfjelds · 2023-08-03T14:50:17Z

@bjfield for me the main use case is "I want to deploy a workload with 2 (distinct) cameras", if I have a single camera available on a node then the workload is not scheduled on that node, but if you expose all slots, if a single camera has two slots then it will end up scheduled on the node even though that's not what you wanted.

The current implementation has a lot of unsolved corner cases here, as @johnsonshih told during community call there is a "race window" when a workload gets deleted where two device plugin slots will still be "available" even though we only want one, or a case where we have two containers on the same pod each asking for the 2 cameras that wouldn't be scheduled. But I don't really see any way to solve these with the Device Plugin API (maybe we could solve these with the new Dynamic Resource Allocation API, but I don't think we want to push it right now) without sacrificing the "If I want 2 distinct cameras, I don't want to end-up with two slots of the same camera" thing.

i think i've always understood this scenario to be: i want to schedule against slots without having to understand instance hashes. nothing about distinct or non-distinct entered into my mind.

bfjelds · 2023-08-04T00:32:55Z

i think i've always understood this scenario to be: i want to schedule against slots without having to understand instance hashes. nothing about distinct or non-distinct entered into my mind.

but i'll defer to the other folks. the code seems good to me, approved.

Signed-off-by: Johnson Shih <[email protected]>

johnsonshih · 2023-08-04T18:02:29Z

i think i've always understood this scenario to be: i want to schedule against slots without having to understand instance hashes. nothing about distinct or non-distinct entered into my mind.

but i'll defer to the other folks. the code seems good to me, approved.

thanks for reviewing this PR.

johnsonshih added 2 commits July 3, 2023 00:20

add configuration device plugin

c6b9b8e

Signed-off-by: Johnson Shih <[email protected]>

update version

2346d85

Signed-off-by: Johnson Shih <[email protected]>

johnsonshih requested review from bfjelds, kate-goldenring, jiria, Britel, romoh, adithyaj and diconico07 as code owners July 3, 2023 07:23

johnsonshih added 4 commits July 5, 2023 11:34

optimize dynamic vdev id selection, only report one additional free vdev

6745ccc

Signed-off-by: Johnson Shih <[email protected]>

write usage information to annotation

ed870c4

Signed-off-by: Johnson Shih <[email protected]>

remove uniqueDevices from Configuration CRD

74a7d46

Signed-off-by: Johnson Shih <[email protected]>

add unit tests

0b2c63b

Signed-off-by: Johnson Shih <[email protected]>

johnsonshih mentioned this pull request Jul 10, 2023

add implementation details for configuration device plugin project-akri/akri-docs#76

Merged

cargo fmt

d6ffc15

Signed-off-by: Johnson Shih <[email protected]>

johnsonshih mentioned this pull request Jul 11, 2023

add DeviceUsage and DeviceUsageKind for Instance.device_usage #628

Merged

8 tasks