Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Configuration device plugin #627

Conversation

johnsonshih
Copy link
Contributor

@johnsonshih johnsonshih commented Jul 3, 2023

What this PR does / why we need it:
This PR is to implement the proposal of exposing resource at Configuration level. https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md#configuration-level-resources
With Configuration level resource, users can select resource to use at configuration level without knowing the instance id beforehand.
Special notes for your reviewer:
The implementation is described in the document PR:project-akri/akri-docs#76

Summary of changes in this PR

  • CL resource and IL resource share the capacity pool. i.e. (# of allocated CL virtual devices + # of allocated IL virtual devices) <= capacity.
  • The name of CL device plugin is the Akri Configuration name and follows the same convention of IL device plugin, i.e., replace ['.', '/'] with "-".
  • The CL device plugin uses dynamic name for virtual device ids. The virtual device id reported by CL device plugin looks like "0", "1", "2", .... The actual device usage slot being used is selected when ConfigurationDevicePlugin::allocate() is called.
  • ConfigurationDevicePlugin represent the behavior of CL device plugin. InstanceDevicePlugin represents the behavior of IL device plugin.
  • DevicePluginService contains a list_and_watch_message_senders to notify refreshing list_and_watch, used by the DevicePluginService internally, a copy of list_and_watch_message_sender is stored in the associated InstanceInfo, used by external entity to refresh the InstanceDevicePlugin's list_and_watch.
  • ConfigurationDevicePlugin contains a list_and_watch_message_sender to notify refreshing list_and_watch, used by the ConfigurationDevicePlugin internally, a copy of list_and_watch_message_sender is store in the InstanceConfig, used by external entity to refresh the ConfigurationDevicePlugin's list_and_watch
  • When IL DPS allocate a virtual device, it notifies CL DPS to refresh list_and_watch, and vice versa, CL DPS notify IL DPS to refresh list_and_watch when it allocates a virtual device.

If applicable:

  • this PR has an associated PR with documentation in akri-docs
  • this PR contains unit tests
  • added code adheres to standard Rust formatting (cargo fmt)
  • code builds properly (cargo build)
  • code is free of common mistakes (cargo clippy)
  • all Akri tests succeed (cargo test)
  • inline documentation builds (cargo doc)
  • all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

Signed-off-by: Johnson Shih <[email protected]>
Signed-off-by: Johnson Shih <[email protected]>
@kate-goldenring
Copy link
Contributor

For history of discussions around this work, see previous PR #565

@bfjelds
Copy link
Collaborator

bfjelds commented Jul 18, 2023

this is so awesome to see!

@bfjelds
Copy link
Collaborator

bfjelds commented Jul 19, 2023

Regarding report already reserved virtual devices + 1 free device usage (if available) per instance, it seems to be a design decision to only allow CL to claim resources from differing nodes (so requesting 2 CL resources would only work if there were 2 nodes with 1 healthy slot ... 1 node with 2 healthy slots could not be scheduled to).

I assume the purpose of that is to ensure high availability. And certainly, that provides it from a node perspective.

I'm not sure I'd make that a hard requirement though ... or at least allow the user to pick if they want it to be a requirement or a preference. For example, if there was a configuration that specifieid NodeHA=preference (rather than requirement) and there was a single node with 2 healthy slots, the user could schedule their workload. This might provide HA from a device API perspective (maybe the camera has glitchy software) or from a network perspective (maybe the network connection to the camera isn't great).

@bfjelds
Copy link
Collaborator

bfjelds commented Jul 19, 2023

sorry, i made a lot of nit comments about variable/function naming ... i think the comments can all boil down to making it easier to understand the code (for me) if:

  • variables that describe maps/vectors/collections/sets were plural
  • consistency in usage of words like slot vs device id, etc

@bfjelds
Copy link
Collaborator

bfjelds commented Jul 19, 2023

overall, this looks super awesome and am very excited about you bringing this feature in!!!

Signed-off-by: Johnson Shih <[email protected]>
@johnsonshih
Copy link
Contributor Author

Regarding report already reserved virtual devices + 1 free device usage (if available) per instance, it seems to be a design decision to only allow CL to claim resources from differing nodes (so requesting 2 CL resources would only work if there were 2 nodes with 1 healthy slot ... 1 node with 2 healthy slots could not be scheduled to).

I assume the purpose of that is to ensure high availability. And certainly, that provides it from a node perspective.

I'm not sure I'd make that a hard requirement though ... or at least allow the user to pick if they want it to be a requirement or a preference. For example, if there was a configuration that specifieid NodeHA=preference (rather than requirement) and there was a single node with 2 healthy slots, the user could schedule their workload. This might provide HA from a device API perspective (maybe the camera has glitchy software) or from a network perspective (maybe the network connection to the camera isn't great).

The algorithm of CL resource allocation ensures allocating slots from different devices for a container on a given node. For example, 2 devices are discovered and the capacity is 3. On nodeA, a container can allocate 1 or 2 slots and you can have multiple containers scheduled to nodeA (e.g. 6 container, each requests 1 slot, or 3 containers each requests 2 slots, or combination of 1-slot and 2-slot containers). On nodeB, the same algorithm is used and both nodes share the total 6 slots.

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.

In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.

If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

@johnsonshih johnsonshih requested a review from bfjelds July 26, 2023 22:16
@johnsonshih johnsonshih requested a review from bfjelds July 31, 2023 20:36
Signed-off-by: Johnson Shih <[email protected]>
&DeviceUsageKind::Configuration(vdev_id.clone()),
&dps.node_name,
)
.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't remember if there was some policy on when to return an error and when to unwrap, do we want to return an error here instead of unwrap call?

Copy link
Collaborator

@bfjelds bfjelds Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly, i just want to make sure that unwrap is called in appropriate and intentional places vs returning an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dps.list_and_watch_message_sender is for notifying list_and_watch thread within the same device plugin service to re-scan the device availability. Since it is always valid as long as the device plugin service is running, the unwrap() should always succeed, if not, then it's unexpected and we fail fast.

@bfjelds
Copy link
Collaborator

bfjelds commented Aug 1, 2023

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.

In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.

If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

i can't seem to wrap my brain around this. i'm not sure i get why we would fail if a container requests 4 slots and there are 6. i can understand why we'd prefer to succeed using slots from 4 different devices, but i don't understand why we'd fail to schedule if we had enough slots for the 2 devices.

@johnsonshih
Copy link
Contributor Author

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.
In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.
If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

i can't seem to wrap my brain around this. i'm not sure i get why we would fail if a container requests 4 slots and there are 6. i can understand why we'd prefer to succeed using slots from 4 different devices, but i don't understand why we'd fail to schedule if we had enough slots for the 2 devices.

There are actually only 2 devices even we reports 6 slots. When allocating virtual devices to a container, does Agent treat each slot as a different virtual device? or Agent maps the slot back to actual device and allocate different devices? If Agent treat each slot as a different virtual device, when a container request 4 virtual devices, it should allow it. If Agent maps slots back to actual devices, then it should reject the allocation since only 2 devices are available.

We had a long discussion about what behavior should be when facing this situation. I pasted the paragraph from the CL-resource proposal (https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md) below. The implementation is based to the concept of allocation by "unique device". If we think allocation by "unique slot" is a valid scenario, we can add that support. (I do have that implemented, we can add a field "uniqueDevice" in Configuration CRD to decide which allocation policy to use.)

There are two implementation options in the case where there are not enough unique Instances to meet the requested number of Configuration-level resources. One scenario where this could happen is if a Pod requests 3 cameras when only 2 exist with a capacity of 2 each. In this case, the Configuration-level camera resource shows as having a quantity of 4, despite there being two cameras. In this case, the kubelet will try to schedule the Pod, thinking there are enough resources. The Agent could either allocate 2 spots on one camera and one on the other or deny the allocation request. The latter is the preferred approach as it is more consistent and ensures the workload is getting the number of unique devices it expects. After failing an allocate request from the kubelet, the Pod will be in an UnexpectedAdmissionError state until another camera comes online and it can be successfully scheduled.

Signed-off-by: Johnson Shih <[email protected]>
@bfjelds
Copy link
Collaborator

bfjelds commented Aug 1, 2023

The purpose of report already reserved virtual devices + 1 free device usage (if available) per instance is to reduce the change that kubelet allocate slots more than available devices. For the example above, 2 devices are discovered and the capacity is 3, we can report 2 * 3 slots or 2 slots first and increase the available slots later.
In the case of reporting 2 * 3 slots, if a container requests 4 CL resources, kubelet issue allocate request for the container and Agent will fail the request due only 2 devices are available. This end up in a infinite loop that kubelet keeps retry to allocate for 4 CL resources until another 2 devices discovered.
If we report 2 slots and increase the available slots later, then the container request 4 CL resources won't be scheduled until the reported available slots exceeds 4. This will reduce the chance that kubelet entering the infinite loop allocating resources for the container.

i can't seem to wrap my brain around this. i'm not sure i get why we would fail if a container requests 4 slots and there are 6. i can understand why we'd prefer to succeed using slots from 4 different devices, but i don't understand why we'd fail to schedule if we had enough slots for the 2 devices.

There are actually only 2 devices even we reports 6 slots. When allocating virtual devices to a container, does Agent treat each slot as a different virtual device? or Agent maps the slot back to actual device and allocate different devices? If Agent treat each slot as a different virtual device, when a container request 4 virtual devices, it should allow it. If Agent maps slots back to actual devices, then it should reject the allocation since only 2 devices are available.

We had a long discussion about what behavior should be when facing this situation. I pasted the paragraph from the CL-resource proposal (https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md) below. The implementation is based to the concept of allocation by "unique device". If we think allocation by "unique slot" is a valid scenario, we can add that support. (I do have that implemented, we can add a field "uniqueDevice" in Configuration CRD to decide which allocation policy to use.)

There are two implementation options in the case where there are not enough unique Instances to meet the requested number of Configuration-level resources. One scenario where this could happen is if a Pod requests 3 cameras when only 2 exist with a capacity of 2 each. In this case, the Configuration-level camera resource shows as having a quantity of 4, despite there being two cameras. In this case, the kubelet will try to schedule the Pod, thinking there are enough resources. The Agent could either allocate 2 spots on one camera and one on the other or deny the allocation request. The latter is the preferred approach as it is more consistent and ensures the workload is getting the number of unique devices it expects. After failing an allocate request from the kubelet, the Pod will be in an UnexpectedAdmissionError state until another camera comes online and it can be successfully scheduled.

i guess i read preference in this line in the proposal: the Agent will preference unique Instances ... it is fine to preference unique instances, but if you don't have that, available access (capacity) should be acceptable.

@bfjelds
Copy link
Collaborator

bfjelds commented Aug 2, 2023

i guess i read preference in this line in the proposal: the Agent will preference unique Instances ... it is fine to preference unique instances, but if you don't have that, available access (capacity) should be acceptable.

@kate-goldenring, I am way late to this party and I imagine you've done far more in-depth thinking on this than I have. My gut is that preferring unique instances is fine for CL, but that there should be a way to schedule to available slots (regardless of instance uniqueness). What is your take?

@diconico07
Copy link
Contributor

i guess i read preference in this line in the proposal: the Agent will preference unique Instances ... it is fine to preference unique instances, but if you don't have that, available access (capacity) should be acceptable.

@kate-goldenring, I am way late to this party and I imagine you've done far more in-depth thinking on this than I have. My gut is that preferring unique instances is fine for CL, but that there should be a way to schedule to available slots (regardless of instance uniqueness). What is your take?

@bjfield for me the main use case is "I want to deploy a workload with 2 (distinct) cameras", if I have a single camera available on a node then the workload is not scheduled on that node, but if you expose all slots, if a single camera has two slots then it will end up scheduled on the node even though that's not what you wanted.

The current implementation has a lot of unsolved corner cases here, as @johnsonshih told during community call there is a "race window" when a workload gets deleted where two device plugin slots will still be "available" even though we only want one, or a case where we have two containers on the same pod each asking for the 2 cameras that wouldn't be scheduled. But I don't really see any way to solve these with the Device Plugin API (maybe we could solve these with the new Dynamic Resource Allocation API, but I don't think we want to push it right now) without sacrificing the "If I want 2 distinct cameras, I don't want to end-up with two slots of the same camera" thing.

@bfjelds
Copy link
Collaborator

bfjelds commented Aug 3, 2023

@bjfield for me the main use case is "I want to deploy a workload with 2 (distinct) cameras", if I have a single camera available on a node then the workload is not scheduled on that node, but if you expose all slots, if a single camera has two slots then it will end up scheduled on the node even though that's not what you wanted.

The current implementation has a lot of unsolved corner cases here, as @johnsonshih told during community call there is a "race window" when a workload gets deleted where two device plugin slots will still be "available" even though we only want one, or a case where we have two containers on the same pod each asking for the 2 cameras that wouldn't be scheduled. But I don't really see any way to solve these with the Device Plugin API (maybe we could solve these with the new Dynamic Resource Allocation API, but I don't think we want to push it right now) without sacrificing the "If I want 2 distinct cameras, I don't want to end-up with two slots of the same camera" thing.

i think i've always understood this scenario to be: i want to schedule against slots without having to understand instance hashes. nothing about distinct or non-distinct entered into my mind.

@bfjelds
Copy link
Collaborator

bfjelds commented Aug 4, 2023

i think i've always understood this scenario to be: i want to schedule against slots without having to understand instance hashes. nothing about distinct or non-distinct entered into my mind.

but i'll defer to the other folks. the code seems good to me, approved.

@johnsonshih
Copy link
Contributor Author

i think i've always understood this scenario to be: i want to schedule against slots without having to understand instance hashes. nothing about distinct or non-distinct entered into my mind.

but i'll defer to the other folks. the code seems good to me, approved.

thanks for reviewing this PR.

@johnsonshih johnsonshih merged commit 42a7615 into project-akri:main Aug 4, 2023
50 checks passed
@johnsonshih johnsonshih deleted the user/jshih/configuration-dp-dynamic-vdev branch August 4, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants