Add Resource Class proposal

Signed-off-by: vikaschoudhary16 <[email protected]>
kubernetes · Jul 6, 2017 · a3d30c3 · a3d30c3
1 parent 3ef47e3
commit a3d30c3
Showing 1 changed file with 333 additions and 0 deletions.
diff --git a/contributors/design-proposals/resource-class.md b/contributors/design-proposals/resource-class.md
@@ -0,0 +1,333 @@
+# Resource Classes Proposal
+
+  1. [Abstract](#abstract)
+  2. [Motivation](#motivation)
+  3. [Use Cases](#use-cases)
+  4. [Objectives](#objectives)
+  5. [Non Objectives](#non-objectives)
+  6. [Resource Class](#resource-class)
+  7. [API Changes](#api-changes)
+  8. [Scheduler Changes](#sch-changes)
+  9. [Kubelet Changes](#kubelet-changes)
+ 10. [Opaque Integer Resources](#oir)
+ 11. [Future Scope](#future-scope)
+
+_Authors:_
+
+* @vikaschoudhary16 - Vikas Choudhary &lt;[email protected]&gt;
+* @aveshagarwal - Avesh Agarwal &lt;[email protected]&gt;
+
+## Abstract
+In this document we will describe *resource classes* which is a new model to
+represent compute resources in Kubernetes. This document should be seen as a
+successor to [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
+and has a dependency on the same.
+
+## Motivation
+Compute resources in Kubernetes are represented as a key-value map with the key
+being a string and the value being a 'Quantity' which can (optionally) be
+fractional. The current model is great for supporting simple compute resources
+like CPU or Memory. The current model requires identity mapping between available
+resources and requested resources. Since 'CPU' and 'Memory' are resources that
+are available across all kubernetes deployments, the current user facing API
+(Pod Specification) remains portable. However the current model cannot support
+new resources like GPUs, ASICs, NICs, local storage, etc., which can potentially
+require non-identity mapping between available and requested resources, and
+require additional metadata about each resource to support heterogeneity and
+management at scale.
+
+_GPU Integration Example:_
+  * [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136)
+  * [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116)
+
+_Kubernetes Meeting Notes On This:_
+  * [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
+  * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
+  * [Extensible support for hardware devices in Kubernetes (join [email protected] for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
+
+## Use Cases
+
+  * I want to have a compute resource type which can be created with meaningful
+    and portable names. This compute resource can hold additional metadata as well
+    that will justify its name, for example:
+    * `nvidia.gpu.high.mem` is the name and metadata is memory greater than 'X' GB.
+    * `fast.nic` is the name and associated metadata is bandwidth greater than
+      'B' gbps.
+  * If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu'
+    type device which has memory greater than or equal to 'X' GB, should be able
+    to satisfy this request, independent of other device capabilities such as
+    'version' or 'nvlink locality' etc.
+  * Similarly, if I request a resource `fast.nic`, any nic device with speed
+    greater than 'B' gbps should be able to meet the request.
+  * I want a rich metadata selection interface where operators like ‘Less Than’,
+    ‘Greater Than’ and ‘In’, are supported on the compute resource metadata.
+
+## Objectives
+
+1. Define and add support in the API for a new type, *Resource Class*.
+2. Add support for *Resource Class* in the scheduler.
+
+## Non Objectives
+1. Discovery, advertisement, allocation/deallocation of devices is expected to
+   be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
+
+## Resource Class
+*Resource Class* is a new type, objects of which provides abstraction over
+[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype).
+A *Resource Class* object selects devices using `matchExpressions`, a list of
+(operator, key, value). A *Resource Class* object selects a device if atleast
+one of the `matchExpressions` matches with device details. Within a matchExpression,
+all the (operator,key,value) are ANDed together to evaluate the result.
+
+YAML example 1:
+```yaml
+kind: ResourceClass
+metadata:
+  name: nvidia.high.mem
+spec:
+  resourceSelector:
+    -
+      matchExpressions:
+        -
+          key: "Kind"
+          operator: "In"
+          values:
+            - "nvidia-gpu"
+          key: "memory"
+          operator: "Gt"
+          values:
+            - "30G"
+```
+Above resource class will select all the nvidia-gpus which have memory greater
+than 30 GB.
+
+YAML example 2:
+```yaml
+kind: ResourceClass
+metadata:
+  name: hugepages-1gig
+spec:
+  resourceSelector:
+    -
+      matchExpressions:
+        -
+          key: "Kind"
+          operator: "In"
+          values:
+            - "huge-pages"
+          key: "size"
+          operator: "Gt"
+          values:
+            - "1G"
+```
+Above resource class will select all the hugepages with size greater than
+equal to 1 GB.
+
+YAML example 3:
+```yaml
+kind: ResourceClass
+metadata:
+  name: fast.nic
+spec:
+  resourceSelector:
+    -
+      matchExpressions:
+        -
+          key: "Kind"
+          operator: "In"
+          values:
+            - "nic"
+          key: "speed"
+          operator: "In"
+          values:
+            - "40GBPS"
+```
+Above resource class will select all the NICs with speed greater than equal to
+40 GBPS.
+
+
+## API Changes
+### ResourceClass
+
+Internal representation of *Resource Class*:
+
+```golang
+// +nonNamespaced=true
+// +genclient=true
+
+type ResourceClass struct {
+        metav1.TypeMeta
+        metav1.ObjectMeta
+        // Spec defines resources required
+        Spec ResourceClassSpec
+        // +optional
+        Status ResourceClassStatus
+}
+// Spec defines resources required
+type ResourceClassSpec struct {
+        // Resource Selector selects resources
+        ResourceSelector []ResourcePropertySelector
+}
+
+// A null or empty selector matches no resources
+type ResourcePropertySelector struct {
+        // A list of resource/device selector requirements. ANDed from each ResourceSelectorRequirement
+        MatchExpressions []ResourceSelectorRequirement
+}
+
+// A resource selector requirement is a selector that contains values, a key, and an operator
+// that relates the key and values
+type ResourceSelectorRequirement struct {
+        // The label key that the selector applies to
+        // +patchMergeKey=key
+        // +patchStrategy=merge
+        Key string
+        // +optional
+        Values []string
+        // operator
+        Operator ResourceSelectorOperator
+}
+type ResourceSelectorOperator string
+
+const (
+        ResourceSelectorOpIn           ResourceSelectorOperator = "In"
+        ResourceSelectorOpNotIn        ResourceSelectorOperator = "NotIn"
+        ResourceSelectorOpExists       ResourceSelectorOperator = "Exists"
+        ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
+        ResourceSelectorOpGt           ResourceSelectorOperator = "Gt"
+        ResourceSelectorOpLt           ResourceSelectorOperator = "Lt"
+)
+```
+### ResourceClassStatus
+```golang
+type ResourceClassStatus struct {
+        Allocatable resources.Quantity
+        Request     resources.Quantity
+}
+```
+ResourceClass status is updated by the scheduler at:
+1. New *Resource Class* object creation.
+2. Node addition to the cluster.
+3. Node removal from the cluster.
+4. Pod creation if pod requests a resource class.
+5. Pod deletion if pod was consuming resource class.
+
+`ResourceClassStatus` serves the following two purposes:
+* Scheduler predicates evaluation while pod creation. For details, please refer
+  further sections
+* User can view the current usage/availability details about the resource class
+  using kubectl.
+
+### User story
+The administrator has deployed device plugins to support hardware present in the
+cluster. Device plugins, running on nodes, will update node status indicating
+the presence of this hardware. To offer this hardware to applications deployed
+on kubernetes in a portable way, the administrator creates a number of resource
+classes to represent that hardware. These resource classes will include metadata
+about the devices as selection criteria.
+
+1. A user submits a pod spec requesting 'X' resource classes.
+2. The scheduler filters the nodes which do not match the resource requests.
+3. scheduler selects a device for each resource class requested and annotates
+   the pod object with device selection info.
+4. Kubelet reads the device request from pod annotation and calls `Allocate` on
+   the matching Device Plugins.
+5. The user deletes the pod or the pod terminates
+6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate`
+   on the matching Device Plugins
+
+In addition to node selection, the scheduler is also responsible for selecting a
+device that matches the resource class requested by the user.
+
+### Reason for not preferring device selection at kubelet
+Kubelet does not maintain any cache. Therefore to know the availability of a device,
+will have to calculate current total consumption by iterating over all the admitted
+pods running on the node. This is already done today while running predicates for
+each new incoming pod at kubelet. Even if we assume that scheduler cache and
+consumption state that is created at runtime for each pod, are exactly same,
+current api interfaces does not allow to pass selected device to container manager
+(where actually device plugin will be invoked from). This problem occurs because
+devices are determined internally from resource classes while other resource
+requests can be determined from pod object directly.
+To summarize, device selection at the kubelet can be done in one of the following
+two ways:
+* Select device at pod admission while applying predicates and change all api
+  interfaces that are required to pass selected device to container runtime manager.
+* Create resource consumption state again at container manager and select device.
+
+None of the above approach seems cleaner than doing device selection at scheduler,
+which helps to retain cleaner api interfaces between packages.
+
+## Scheduler Changes
+Scheduler already listens and maintains state in the cache for any changes in
+node or pod objects. We will enhance the logic:
+1. To listen and maintain the state in cache for user created *Resource Class* objects.
+2. To look for device related details in node objects and maintain accounting for
+   devices as well.
+
+From the events perspective, handling for the following events will be added/updated:
+
+### Resource Class Creation
+1. Init and add resource class info into local cache
+2. Iterate over all existing nodes in cache to figure out if there are devices
+   on these nodes which are selectable by resource class. If found, update the
+   resource class availability status in local cache.
+3. Patch the status of resource class api object with availability state in locyy
+   cache
+
+### Resource Class Deletion
+Delete the resource class info from the cache.
+
+### Node Addition
+Scheduler already caches `NodeInfo`. Now additionally update device state:
+1. Check in the node status if any devices are present.
+2. For each device found, iterate over all existing resource classes in the cache
+   to find resource classes which can select this particular device. For all
+   such resource classes, update the availability state in the local cache.
+3. ResourceClass api object's status, `ResourceClassStatus` will be patched
+   with the new “allocatable” vplue
+
+### Node Deletion
+If node has devices which are selectable by existing resource classes:
+1. Adjust resource class state in local cache.
+2. Update resource class status by patching api object.
+
+### Pod Creation
+1. Get the requested resource class name and quantity from pod spec.
+2. Select nodes by applying predicates according to requested quantity and Resource
+   class's state present in the cache.
+3. On the selected node, select a Device from the stored devices info in cache
+   after matching key,value from requested resource class.
+4. After device selection, update(decrease) 'Requested' for all the resource
+   classes which could select this device in the cache.
+5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
+6. Add the pod reference in local DeviceToPod mapping structure in the cache.
+7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'
+
+### Pod Delete
+1. Iterate over the all the devices on the at which pod was scheduled to and
+   find out the devices being used by pod.
+2. For each device consumed by pod, update availability state of Resource classes
+   which can select this device in the cache.
+3. Patch `ResourceClassStatus` with new availability state.
+
+## Kubelet Changes
+Update logic at container runtime manager to look for device annotations,
+prefixed by 'scheduler.alpha.kubernetes.io/resClass' and call matching device
+plugins.
+
+## Opaque Integer Resources
+This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
+(OIR). External agents can continue to attach additional 'opaque' resources to
+nodes, but the special naming scheme that is part of the current OIR approach
+will no longer be necessary. Any existing resource discovery tool which updates
+node objects with OIR, will adapt to update node status with devices instead.
+
+
+## Future Scope
+* RBAC: It can further be explored that how to tie resource classes with RBAC
+  like any other existing API resource objects.
+* Nested Resource Classes: In future device plugins and resource classes can be 
+  extended to support the nested resource class functionality where one resource
+  class could be comprised of a group of sub-resource classes. For example 'numa-node'
+  resource class comprised of sub-resource classes, 'single-core'.