Skip to content

Commit

Permalink
Add Resource Class proposal
Browse files Browse the repository at this point in the history
Signed-off-by: vikaschoudhary16 <[email protected]>
  • Loading branch information
vikaschoudhary16 committed Jul 6, 2017
1 parent 3ef47e3 commit a3d30c3
Showing 1 changed file with 333 additions and 0 deletions.
333 changes: 333 additions & 0 deletions contributors/design-proposals/resource-class.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,333 @@
# Resource Classes Proposal

1. [Abstract](#abstract)
2. [Motivation](#motivation)
3. [Use Cases](#use-cases)
4. [Objectives](#objectives)
5. [Non Objectives](#non-objectives)
6. [Resource Class](#resource-class)
7. [API Changes](#api-changes)
8. [Scheduler Changes](#sch-changes)
9. [Kubelet Changes](#kubelet-changes)
10. [Opaque Integer Resources](#oir)
11. [Future Scope](#future-scope)

_Authors:_

* @vikaschoudhary16 - Vikas Choudhary &lt;[email protected]&gt;
* @aveshagarwal - Avesh Agarwal &lt;[email protected]&gt;

## Abstract
In this document we will describe *resource classes* which is a new model to
represent compute resources in Kubernetes. This document should be seen as a
successor to [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
and has a dependency on the same.

## Motivation
Compute resources in Kubernetes are represented as a key-value map with the key
being a string and the value being a 'Quantity' which can (optionally) be
fractional. The current model is great for supporting simple compute resources
like CPU or Memory. The current model requires identity mapping between available
resources and requested resources. Since 'CPU' and 'Memory' are resources that
are available across all kubernetes deployments, the current user facing API
(Pod Specification) remains portable. However the current model cannot support
new resources like GPUs, ASICs, NICs, local storage, etc., which can potentially
require non-identity mapping between available and requested resources, and
require additional metadata about each resource to support heterogeneity and
management at scale.

_GPU Integration Example:_
* [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136)
* [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116)

_Kubernetes Meeting Notes On This:_
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
* [Extensible support for hardware devices in Kubernetes (join [email protected] for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)

## Use Cases

* I want to have a compute resource type which can be created with meaningful
and portable names. This compute resource can hold additional metadata as well
that will justify its name, for example:
* `nvidia.gpu.high.mem` is the name and metadata is memory greater than 'X' GB.
* `fast.nic` is the name and associated metadata is bandwidth greater than
'B' gbps.
* If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu'
type device which has memory greater than or equal to 'X' GB, should be able
to satisfy this request, independent of other device capabilities such as
'version' or 'nvlink locality' etc.
* Similarly, if I request a resource `fast.nic`, any nic device with speed
greater than 'B' gbps should be able to meet the request.
* I want a rich metadata selection interface where operators like ‘Less Than’,
‘Greater Than’ and ‘In’, are supported on the compute resource metadata.

## Objectives

1. Define and add support in the API for a new type, *Resource Class*.
2. Add support for *Resource Class* in the scheduler.

## Non Objectives
1. Discovery, advertisement, allocation/deallocation of devices is expected to
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)

## Resource Class
*Resource Class* is a new type, objects of which provides abstraction over
[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype).
A *Resource Class* object selects devices using `matchExpressions`, a list of
(operator, key, value). A *Resource Class* object selects a device if atleast
one of the `matchExpressions` matches with device details. Within a matchExpression,
all the (operator,key,value) are ANDed together to evaluate the result.

YAML example 1:
```yaml
kind: ResourceClass
metadata:
name: nvidia.high.mem
spec:
resourceSelector:
-
matchExpressions:
-
key: "Kind"
operator: "In"
values:
- "nvidia-gpu"
key: "memory"
operator: "Gt"
values:
- "30G"
```
Above resource class will select all the nvidia-gpus which have memory greater
than 30 GB.
YAML example 2:
```yaml
kind: ResourceClass
metadata:
name: hugepages-1gig
spec:
resourceSelector:
-
matchExpressions:
-
key: "Kind"
operator: "In"
values:
- "huge-pages"
key: "size"
operator: "Gt"
values:
- "1G"
```
Above resource class will select all the hugepages with size greater than
equal to 1 GB.
YAML example 3:
```yaml
kind: ResourceClass
metadata:
name: fast.nic
spec:
resourceSelector:
-
matchExpressions:
-
key: "Kind"
operator: "In"
values:
- "nic"
key: "speed"
operator: "In"
values:
- "40GBPS"
```
Above resource class will select all the NICs with speed greater than equal to
40 GBPS.
## API Changes
### ResourceClass
Internal representation of *Resource Class*:
```golang
// +nonNamespaced=true
// +genclient=true

type ResourceClass struct {
metav1.TypeMeta
metav1.ObjectMeta
// Spec defines resources required
Spec ResourceClassSpec
// +optional
Status ResourceClassStatus
}
// Spec defines resources required
type ResourceClassSpec struct {
// Resource Selector selects resources
ResourceSelector []ResourcePropertySelector
}

// A null or empty selector matches no resources
type ResourcePropertySelector struct {
// A list of resource/device selector requirements. ANDed from each ResourceSelectorRequirement
MatchExpressions []ResourceSelectorRequirement
}

// A resource selector requirement is a selector that contains values, a key, and an operator
// that relates the key and values
type ResourceSelectorRequirement struct {
// The label key that the selector applies to
// +patchMergeKey=key
// +patchStrategy=merge
Key string
// +optional
Values []string
// operator
Operator ResourceSelectorOperator
}
type ResourceSelectorOperator string

const (
ResourceSelectorOpIn ResourceSelectorOperator = "In"
ResourceSelectorOpNotIn ResourceSelectorOperator = "NotIn"
ResourceSelectorOpExists ResourceSelectorOperator = "Exists"
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
ResourceSelectorOpGt ResourceSelectorOperator = "Gt"
ResourceSelectorOpLt ResourceSelectorOperator = "Lt"
)
```
### ResourceClassStatus
```golang
type ResourceClassStatus struct {
Allocatable resources.Quantity
Request resources.Quantity
}
```
ResourceClass status is updated by the scheduler at:
1. New *Resource Class* object creation.
2. Node addition to the cluster.
3. Node removal from the cluster.
4. Pod creation if pod requests a resource class.
5. Pod deletion if pod was consuming resource class.

`ResourceClassStatus` serves the following two purposes:
* Scheduler predicates evaluation while pod creation. For details, please refer
further sections
* User can view the current usage/availability details about the resource class
using kubectl.

### User story
The administrator has deployed device plugins to support hardware present in the
cluster. Device plugins, running on nodes, will update node status indicating
the presence of this hardware. To offer this hardware to applications deployed
on kubernetes in a portable way, the administrator creates a number of resource
classes to represent that hardware. These resource classes will include metadata
about the devices as selection criteria.

1. A user submits a pod spec requesting 'X' resource classes.
2. The scheduler filters the nodes which do not match the resource requests.
3. scheduler selects a device for each resource class requested and annotates
the pod object with device selection info.
4. Kubelet reads the device request from pod annotation and calls `Allocate` on
the matching Device Plugins.
5. The user deletes the pod or the pod terminates
6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate`
on the matching Device Plugins

In addition to node selection, the scheduler is also responsible for selecting a
device that matches the resource class requested by the user.

### Reason for not preferring device selection at kubelet
Kubelet does not maintain any cache. Therefore to know the availability of a device,
will have to calculate current total consumption by iterating over all the admitted
pods running on the node. This is already done today while running predicates for
each new incoming pod at kubelet. Even if we assume that scheduler cache and
consumption state that is created at runtime for each pod, are exactly same,
current api interfaces does not allow to pass selected device to container manager
(where actually device plugin will be invoked from). This problem occurs because
devices are determined internally from resource classes while other resource
requests can be determined from pod object directly.
To summarize, device selection at the kubelet can be done in one of the following
two ways:
* Select device at pod admission while applying predicates and change all api
interfaces that are required to pass selected device to container runtime manager.
* Create resource consumption state again at container manager and select device.

None of the above approach seems cleaner than doing device selection at scheduler,
which helps to retain cleaner api interfaces between packages.

## Scheduler Changes
Scheduler already listens and maintains state in the cache for any changes in
node or pod objects. We will enhance the logic:
1. To listen and maintain the state in cache for user created *Resource Class* objects.
2. To look for device related details in node objects and maintain accounting for
devices as well.

From the events perspective, handling for the following events will be added/updated:

### Resource Class Creation
1. Init and add resource class info into local cache
2. Iterate over all existing nodes in cache to figure out if there are devices
on these nodes which are selectable by resource class. If found, update the
resource class availability status in local cache.
3. Patch the status of resource class api object with availability state in locyy
cache

### Resource Class Deletion
Delete the resource class info from the cache.

### Node Addition
Scheduler already caches `NodeInfo`. Now additionally update device state:
1. Check in the node status if any devices are present.
2. For each device found, iterate over all existing resource classes in the cache
to find resource classes which can select this particular device. For all
such resource classes, update the availability state in the local cache.
3. ResourceClass api object's status, `ResourceClassStatus` will be patched
with the new “allocatable” vplue

### Node Deletion
If node has devices which are selectable by existing resource classes:
1. Adjust resource class state in local cache.
2. Update resource class status by patching api object.

### Pod Creation
1. Get the requested resource class name and quantity from pod spec.
2. Select nodes by applying predicates according to requested quantity and Resource
class's state present in the cache.
3. On the selected node, select a Device from the stored devices info in cache
after matching key,value from requested resource class.
4. After device selection, update(decrease) 'Requested' for all the resource
classes which could select this device in the cache.
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
6. Add the pod reference in local DeviceToPod mapping structure in the cache.
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'

### Pod Delete
1. Iterate over the all the devices on the at which pod was scheduled to and
find out the devices being used by pod.
2. For each device consumed by pod, update availability state of Resource classes
which can select this device in the cache.
3. Patch `ResourceClassStatus` with new availability state.

## Kubelet Changes
Update logic at container runtime manager to look for device annotations,
prefixed by 'scheduler.alpha.kubernetes.io/resClass' and call matching device
plugins.

## Opaque Integer Resources
This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
(OIR). External agents can continue to attach additional 'opaque' resources to
nodes, but the special naming scheme that is part of the current OIR approach
will no longer be necessary. Any existing resource discovery tool which updates
node objects with OIR, will adapt to update node status with devices instead.


## Future Scope
* RBAC: It can further be explored that how to tie resource classes with RBAC
like any other existing API resource objects.
* Nested Resource Classes: In future device plugins and resource classes can be
extended to support the nested resource class functionality where one resource
class could be comprised of a group of sub-resource classes. For example 'numa-node'
resource class comprised of sub-resource classes, 'single-core'.

0 comments on commit a3d30c3

Please sign in to comment.