From ef8916b54bec2a4adbb98c7e15183deae4fcb215 Mon Sep 17 00:00:00 2001 From: vikaschoudhary16 Date: Thu, 14 Jun 2018 03:05:59 -0400 Subject: [PATCH 1/5] Reserve KEP number for new resource api proposal --- keps/NEXT_KEP_NUMBER | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/NEXT_KEP_NUMBER b/keps/NEXT_KEP_NUMBER index 8351c19397f..60d3b2f4a4c 100644 --- a/keps/NEXT_KEP_NUMBER +++ b/keps/NEXT_KEP_NUMBER @@ -1 +1 @@ -14 +15 From a6af1636a8fc3cfdda6f533661172f70e621bdb0 Mon Sep 17 00:00:00 2001 From: vikaschoudhary16 Date: Thu, 14 Jun 2018 03:12:28 -0400 Subject: [PATCH 2/5] KEP: New Resource API --- keps/sig-node/00014-resource-api.md | 425 ++++++++++++++++++++++++++++ 1 file changed, 425 insertions(+) create mode 100644 keps/sig-node/00014-resource-api.md diff --git a/keps/sig-node/00014-resource-api.md b/keps/sig-node/00014-resource-api.md new file mode 100644 index 00000000000..679c999e537 --- /dev/null +++ b/keps/sig-node/00014-resource-api.md @@ -0,0 +1,425 @@ +--- +kep-number: 14 +title: New Resource API Proposal +authors: + - "@vikaschoudhary16" + - "@jiayingz" +owning-sig: sig-node +participating-sigs: + - sig-scheduling +reviewers: + - "@thockin" + - "@derekwaynecarr" + - "@dchen1107" + - "@bsalamat" + - "@vishh" +approvers: + - "@sig-node-leads" +editor: "@vikaschoudhary16" +creation-date: "2018-06-14" +last-updated: "2018-06-14" +status: provisional +--- +# New Resource API Proposal + +Table of Contents +================= +* [Abstract](#abstract) +* [Background](#background) +* [Use Stories](#user-stories) + * [As a cluster operator](#as-a-cluster-operator) + * [As a developer](#as-a-developer) + * [As a vendor](#as-a-vendor) +* [Objectives](#objectives) +* [Non Objectives](#non-objectives) +* [Components](#components) + * [ResourceClass API](#resourceclass-api) + * [Kubelet Extension](#kubelet-extension) + * [Scheduler Extension](#scheduler-extension) + * [Quota Extension](#quota-extension) +* [Roadmap](#roadmap) + +## Abstract +In this document we will describe a new resource API model to better support non-native compute resources on Kubernetes. + +## Background +We are seeing increasing needs to better support non-native compute resources on Kubernetes that cover a wide range of resources such as GPUs, High-performance NICs, Infiniband and FPGAs. Such resources often require vendor specific setup, and have rich sets of different properties even across devices of the same type. This brings new requirements for Kubernetes to better support non-native compute resources to allow vendor specific setup, dynamic resource exporting, flexible resource configuration, and portable resource specification. + +The device plugin support added in Kubernetes 1.8 makes it easy for vendors to dynamically export their resources through a plugin API without changing Kubernetes core code. Taking a further step, this document proposes a new resource abstraction API, ResourceClass, that can be used to describe, manage and consume such vendor specific and metadata rich resources in simple and portable ways. + +## Use Stories +### As a cluster operator: +- Nodes in my cluster has GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.
+**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs gets added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.
+**Can this be solved without resource classes:** A unique taint can be used to represent a category like silver. Nodes can be tainted accordingly depending upon the type of GPUs availability. User pods can use tolerations to steer workloads to the appropriate nodes. But problem is how to restrict a user pod from not using the toleration that it should not be using?
+**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated. + +- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and shares a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.
+**Motivation:** Increased performance because of local access. Local access also helps better use of cache
+**How Resource classes can solve this:** Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.
+**Can this be solved without resource classes:** No + +- I want to have quota control on the devices at the granularity of device properties. For example, I want to have a separate quota for ECC enabled GPUs. I want a specific user to not let use more than ‘N’ number of ECC enabled GPUs overall at namespace level.
+**Motivation:** This will make it possible to charge user per specialized hw consumption. Since special HW is costly, as an Operator I want to have this capability.
+**How Resource classes can solve this:** Quota will be supported on resource class objects and by allowing resource request in user pods via resource class, charging policy can be linked with resource consumption.
+**Can this be solved without resource classes:** No + +- In my cluster, I have many different classes (different capabilities) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation. +Few nodes are connected to data network over 40 Gig NICs and others are connected over normal 1 Gig NICs. I want end user pods to be able to request +data network connectivity with high network performance while +in default case, data network connectivity is offered via normal 1 Gbps NICs.
+**Motivation:** If some workloads demand higher network bandwidth, it should be possible to run these workloads on selected nodes.
+**Can this be solved without resource classes:** Taints and tolerations can help in steering pods but the problem in that there is no way today to have access control over use of tolerations and therefore if multiple users are there, it is not possible to have control on allowed tolerations.
+**How Resource classes can solve this:** I can define a ResourceClass for the high-performance NIC with minimum bandwidth requirements, and makes sure only users with proper quota can use such resources. + +- I want to be able to utilize different 'types' of a HW resource (not necessarily from the same vendor) while not losing workload portability when moving from one cluster/cloud to another. There can be one type of Nvidia GPUs on one cluster and another type of Nvidia GPUs on another cluster. This is example of different ‘types’ of a HW resource(GPU). I want to offer GPUs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these GPUs with a generic resource class name, workload can be migrated from one cluster to another transparently.
+ +**Quoting Henry Saputra (From Ebay) for the record:**
+>Currently we ask developers to submit resource specifications for GPU using name of the cards to our data center : +> +>"accelerator": { +>"type": "gpu", +>"quantity": "1", +>"labels": { +>"product": "nvidia", +>"family": "tesla", +>"model": "m40" +>} +>} +> +>But when we go to other cloud such as Google or AWS they may not have the same cards. +> +>So I was wondering if we could offer resource such as CUDA cores and memory as resource specifications rather actual name and type of the cards. +> + +**Motivation:** less downtime, optimal use of resources
+**How Resource classes can solve this:** Explained above
+**Can this be solved without resource classes:** No + +### As a developer: +- I want the ability to be able to request devices which have specific capabilities. Eg: GPUs that are Volta or newer.
+ **Motivation:** I want minimum guaranteed compute performance
+ **Can this be solved without resource classes:**
+ - Yes, using node labels and NodeLabelSelectors. + Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations. + - OR, Instead of using resource class, provide flexibility to query resource properties directly in pod container resource requests. + Problem: In a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable. + + **How Resource classes can solve this:** +The Kubernetes scheduler is the central place to map container resource requests expressed through ResourceClass names to the underlying qualified physical resources, which automatically supports metadata aware resource scheduling. + +- As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details. I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec. When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.
+**Motivation:** Separation of concerns between cluster operators and cluster users.
+**Can this be solved without resource classes:**
+Without the additional abstraction layer, consuming the non-standard, metadata-rich compute resources would be fragmented. More likely, we would see cluster providers implement their own solutions to address their user pains, and it would be hard to provide a consistent user experience for consuming extended resources in the future. + +### As a vendor: +- I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware.
+**Motivation:** enables more compute resources and their advanced features on Kubernetes
+**Can this be solved without resource classes:**
+Yes, Using node labels and NodeLabelSelectors.
+Problem: Lack of access control and lack of the ability to differentiate between hardware properties on the same node. E.g., if on the same node, some GPU devices are connected through nvlink while others are connected through PCI-e, vendors don’t have ways to export such resource properties that can have very different performance impacts.
+**How Resource classes can solve this:**
+Vendors can use DevicePlugin API to propagate new hardware features, and provide best-practice ResourceClass spec to consume their new hardware or new hardware features on Kubernetes. Vendors don’t need to worry supporting this new hardware would break existing use cases on old hardware because the Kubernetes scheduler takes the resource metadata into account during pod scheduling, and so only pods that explicitly request this new hardware through the corresponding ResourceClass name will be allocated with such resources. + +## Objectives +Essentially, a ResourceClass object maps a non-native compute resource with a specific set of properties to a portable name. Cluster admins can create different ResourceClass objects with the same generic name on different clusters to match the underlying hardware configuration. Users can then use the portable names to consume the matching compute resources. Through this extra abstraction layer, we are hoping to achieve the following goals: +- **Allows workloads to request compute resources with wide range of properties in simple and standard way.** We propose to introduce a new `ComputeResource` API and a field, `ComputeResources` in the `Node.Status` to store a list of `ComputeResource` objects. Kubelet can use a `ComputeResource` object to encapsulate the resource metadata information associated with its underlying physical compute resources and propagate this information to the scheduler by appending it to the `ComputeResources` list in the `Node.Status`. With the resource metadata information, the Kubernetes scheduler can determine the fitness of a node for a container resource request expressed through ResourceClass name by evaluating whether the node has enough unallocated units of a ComputeResource matching the property constraints specified in the ResourceClass. This allows the Kubernetes scheduler to take resource property into account to schedule pods on the right node whose hardware configuration meets the specific resource metadata constraints. +- **Allows cluster admins to configure and manage different kinds of non-native compute resources in flexible and simple ways.** A cluster admin creates a ResourceClass object that specifies a portable ResourceClass name (e.g., `fast-nic-gold`), and list of property matching constraints (e.g., `resourceName in (solarflare.com/fast-nic, intel.com/fast-nic)`, or `type=XtremeScale-8000`, or `bandwidth=100G`, or `zone in (us-west1-b, us-west1-c)`). The property matching constraints follow the generic LabelSelector format, which allows us to cover a wide range of resource specific properties. The cluster admin can then define and manage resource quota with the created ResourceClass object. +- **Allows vendors to export their resources on Kubernetes more easily.** The device plugin that a vendor needs to implement to make their resource available on Kubernetes lives outside Kubernetes core repository. The device plugin API will be extended to pass device properties from device plugin to Kubelet. Kubelet will propagate this information to the scheduler through ComputeResource and the scheduler will match a ComputeResource with certain properties to the best matching ResourceClass, and support resource requests expressed through ResourceClass name. The device plugin only needs to retrieve device properties through some device-specific API or tool, without needing to watch or understand either ComputeResource objects or ResourceClass objects. +- **Provides a unified interface to interpret compute resources across various system components such as Quota and Container Resource Spec.** By introducing ResourceClass as a first-class API object, we provide a built-in solution for users to define their special resource constraints through this API, to request such resources through the existing Container Resource Spec, to limit access for such resources through the existing resource Quota component and to ensure their Pods land on the nodes with the matching physical resources through the default Kubernetes scheduling. +- **Supports node-level resources as well as cluster-level resources.** Certain types of compute resources are tied to single nodes and are only accessible locally on those nodes. On the other hand, some types of compute resources such as network attached resources, can be dynamically bound to a chosen node that doesn’t have the resource available till the binding finishes. For such resources, the metadata constraints specified through ResourceClass can be consumed by a standalone controller or a scheduler extender during their resource provisioning or scheduler filtering so that the resource can be provisioned properly to meet the specified metadata constraints. +- **Supports cluster auto scaling for extended resources.** We have seen challenges on how to make cluster autoscaler work seamlessly with dynamically exported extended resources. In particular, for node level extended resources that are exported by a device plugin, cluster autoscaler needs to know what resources will be exported on a newly created node, how much of such resources will be exported and how long it will take for the resource to be exported on the node. Otherwise, it would keep creating new nodes for the pending pod during this time gap. For cluster level extended resources, their resource provisionings are generally performed dynamically by a separate controller. Cluster autoscaler needs to be taught to filter out the resource requests for such resources for the pending pod so that it can create right type of node based on node level resource requests. Note that Kubelet and the scheduler have the similar need to ignore such resource requests during their `PodFitsResources` evaluation. As we are introducing the new resource API that can be used to export arbitrary resource metadata along with extended resources, we need to define a general mechanism for cluster autoscaler to learn the upcoming resource property and capacity on a new node and ensure a consistent resource evaluation policy among cluster autoscaler, scheduler and Kubelet. +- **Defines an easy and seamless migration path** for clusters to adopt ResourceClass even if they have existing pods requesting compute resources through raw resource name. In particular, suppose a cluster is running some workloads that have non native compute resources, such as `nvidia.com/gpu`, in their pod resource requirements. Such workloads should still be scheduled properly when the cluster admin creates a ResourceClass that matches the gpu resources in the cluster. Furthermore, we want to support the upgrade scenario that new resource properties can be added for a resource, e.g., through device plugin upgrade and cluster admins may define a new ResourceClass based on the newly exported resource properties without breaking the use of old ResourceClasses. + +## Non Objectives +- Extends the current resource requirement API of the container spec. The current resource requirement API is basically a “name:value” list. A commonly arising question is whether we should extend this API to support resource metadata requirements. We can consider this as a possible extension orthogonal to the ResourceClass proposal. A primary reason we propose to introduce ResourceClass API first is because non-native compute resources usually lack standard resource properties. Although there are benefits to allow users to directly express their resource metadata requirements in their container spec, it may also compromise workload portability if not used carefully. It is also hard to implment resource quota when users directly declare resource metadata requirements in their Container spec. By introducing ResourceClass as an additional resource abstraction layer, users can express their special resource requirements through a high-level portable name, and cluster admins can configure compute resources properly on different environments to meet such requirements. We feel this helps promote portability and separation of concerns, while still maintains API compatibility. +- Unifies with the StorageClass API. Although ResourceClass shares many similar motivations and requirements as the existing StorageClass API, they focus on different kinds of resources. StorageClass is used to represent storage resources that are stateful and contains special storage semantics. ResourceClass, on the other hand, focuses on stateless compute resources, whose usage is bound to container lifecycle and can’t be shared across multiple nodes at the same time. For these reasons, we don’t plan to unify the two APIs. +- Resource overcommitment, fractional resource requirements, native compute resource (i.e., cpu and memory) with special metadata requirements, and group compute resources. They are out of our current scope. + +## Components +### ResourceClass API +During the initial phase, we propose to start with the following ResourceClass API spec that defines the basic ResourceClass name to the underlying node level compute resource plus metadata matching. We can extend this API to better support cluster resources and group resources in the following phases of development. However, this document will mostly focus on the design to support initial phase requirements. + +```golang +// +nonNamespaced=true +// +genclient=true + +type ResourceClass struct { + metav1.TypeMeta + metav1.ObjectMeta + Spec ResourceClassSpec + // +optional +} + +type ResourceClassSpec struct { + // defines general resource property matching constraints. + // e.g.: zone in { us-west1-b, us-west1-c }; type: k80 + MetadataRequirements metav1.LabelSelector + // used to compare preference of two matching ResourceClasses + // The purpose to introduce this field is explained more later + Priority int +} +``` + +YAML example 1: +```yaml +kind: ResourceClass +metadata: + name: nvidia.high.mem +spec: + labelSelector: + - matchExpressions: + - key: "Kind" + operator: "In" + values: + - "nvidia-gpu" + - key: "memory" + operator: "GtEq" + values: + - "30G" + +kind: Pod +metadata: + name: example-pod +spec: + containers: + - name: example-container + resources: + limits: + nvidia.high.mem: 2 + +``` +Above resource class will select all the nvidia-gpus which have memory greater +than and equal to 30 GB. + +YAML example 2: +```yaml +kind: ResourceClass +metadata: + name: fast.nic +spec: + labelSelector: + - matchExpressions: + - key: "Kind" + operator: "In" + values: + - "nic" + - key: "speed" + operator: "GtEq" + values: + - "40GBPS" + +kind: Pod +metadata: + name: example-pod +spec: + containers: + - name: example-container + resources: + limits: + fast.nic: 1 +``` +Above resource class will select all the NICs with speed greater than equal to +40 GBPS. + +### Kubelet Extension +On node level, extended resources can be exported automatically by a third-party plugin through the Device Plugin API. We propose to extend the current Device Plugin API that allows device plugins to send to the Kubelet per-device properties during device listing. Exporting device properties at per-device level instead of per-resource level allows a device plugin to manage devices with heterogeneous properties. + +After receiving device availability and property information from a device plugin, Kubelet needs to propagate this information to the scheduler so that scheduler can take resource metadata into account when it is making the scheduling decision. We propose to add a new `ComputeResources` array field in NodeStatus API to represent a list of the `ComputeResource` instances where each represent a device resource and the associated resource properties. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. During the initial phase, we plan to start with exporting extended resources through the ComputeResource API but leaves primary resources in its current exporting model. We can extend the ComputeResource model to support primary resources later after getting more experience through the initial phase. Kubelet will update ComputeResources field upon any resource availability or property change for node-level resources. + +We propose to start with the following struct definition: + +```golang +type NodeStatus struct { + … + ComputeResources []ComputeResource + … +} +type ComputeResource struct { + // raw resource name. E.g.: nvidia.com/gpu + ResourceName string + // resource metadata received from device plugin. + // e.g., gpuType: k80, zone: us-west1-b + Properties map[string]string + // list of deviceIds received from device plugin. + // e.g., ["nvida0", "nvidia1"] + Devices []string +} +``` + +Possible fields we may consider to add later include: +- `DeviceUnits resource.Quantity`. This field can be used to support fractional + resource or infinite resource. In a more advanced use case, a device plugin may + even advertise a single Device with X DeviceUnits so that it can make its own + device allocation decisions, although this usually require the device plugin + to implement its own complex logic to track resource life cycle. +- `Owner string`. Can be Kubelet or some cluster-level controller to indicate + the ownership and scope of the resource. +- `IsolationGuarantee string`. Can map to "ContainerLevel", or "PodLevel", or + "NodeLevel" to support resource sharing with different levels of isolation + guarantees. + +Note we intentially leave these fields out of the initial design to limit the scope +of this proposal. + +### Scheduler Extension +The scheduler needs to watch for NodeStatus ComputeResources field changes and ResourceClass object updates and caches the binding information between the ResourceClass and the matchingComputeResources so that it can serve container resource request expressed through ResourceClass names. + +A natural question is how we should define the matching behavior. Suppose there are two ResourceClass objects. ResourceClass RC1 has metadata matching constraint “property1 = value1”, and ResourceClass RC2 has metadata matching constraint “property2 = value2”. Suppose a ComputeResource has both “property1: value1” and “property2: value2” properties, and so match both ResourceClasses. Now should the scheduler consider this ComputeResource as qualified for both ResourceClasses RC1 and RC2, or only one of them? + +We feel the desired answer to this question may vary across different types of resources, properties and use cases. To illustrate this, lets consider the following example: A GPU device plugin in a cluster with different types of GPUs may be configured to export a single property, "Type", at the beginning. To support per GPU type resource quota, cluster admins may define the following ResourceClasses: + +```yaml +kind: ResourceClass +metadata: + name: nvidia-k80 +spec: + labelSelector: + - matchExpressions: + - key: "Type" + operator: "Eq" + values: + - "nvidia-tesla-k80" + +kind: ResourceClass +metadata: + name: nvidia-p100 +spec: + labelSelector: + - matchExpressions: + - key: "Type" + operator: "Eq" + values: + - "nvidia-tesla-p100" +``` + +Later on, suppose the cluster admins add a new GPU node group with a new version of GPU device plugin that exports another resource property "Nvlink" which will be set true for nvidia-tesla-p100 GPUs connected through nvlinks. To utilize this new feature, the cluster admins define the following new ResourceClass with nvlink constraints: + +```yaml +kind: ResourceClass +metadata: + name: nvidia-p100-nvlink +spec: + labelSelector: + - matchExpressions: + - key: "Type" + operator: "Eq" + values: + - "nvidia-tesla-p100" + - key: "Nvlink" + operator: "Eq" + values: + - "true" +``` + +Now we face the question that whether the scheduler should allow Pods requesting +"nvidia-p100" to land on a node in this new GPU node groups. So far, we have +received different feedbacks on this question. In some use cases, users would +like to have minimum matching behavior that as long as the underlying hardware +matches the minimal requirements specified through ResourceClass contraints, +they want to allow Pods to be scheduled on the hardware. On the other hand, some users desire to reserve expensive hardware resources for users who explicitly request them. +We feel both use cases are valid requirements. Allowing a ComputeResource to match +multiple ResourceClasses as long as it matches their matching constraints +perhaps yields least surprising behavior to users and also simplies upgrade +scenario as new resource properties are introduced into the system. Therefore we +support this behavior by default. To also provide an easy way for cluster admins +to reserve expensive compute resources and control their access with resource +quota, we propose to include a Priority field in ResourceClass API. +By default, the value of this field is set to zero, but cluster admins can set +it to a higher value, which would prevent its matching compute resources from +being matched by lower priority ResourceClasses. i.e., +when a ComputeResource matches multiple ResourceClasses with different Priority values, the scheduler will choose those with the highest Priority. +Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints. + +Because a ComputeResource can match multiple ResourceClasses, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2. + +To maintain and propagate ResourceClass to ComputeResource binding information, the scheduler will need to record this information in a newly introduced ContainerSpec field, similar to the existing NodeName field, and Kubelet will need to consume this information. During the initial implementation, we propose to encode the ResourceClass to the underlying compute resource binding information in a new `AllocatedDeviceIDs map[v1.ResourceName][]types.UID` field in ContainerSpec. Adding this field has been discussed as a possible solution to support other use cases, such as third-party resource monitoring and network device plugins. For the purpose to support ResourceClass, we will extend the scheduler NodeInfo cache to store ResourceClass to the matching ComputeResource information on the node. For a given ComputeResource, its capacity will be reflected in NodeInfo.allocatableResource with all matching ResourceClass names. This way, the current node resource fitness evaluation will stay most the same. After a pod is bound to a node, the scheduler will choose the requested number of devices from the matching ComputeResource on the node, and record this information in the mentioned new field. After that, it increases the NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. Note that if the AllocatedDeviceIDs field is pre-specified, scheduler should honor this binding instead of overwriting it, similar to how it handles pre-specified NodeName. + +A main reason we propose to have the scheduler make and record device level +scheduling decision is so that the scheduler can maintain accurate resource acounting information. +The matching from a ResourceClass to the underlying compute resources may change +from two kinds of updates. First, cluster admins may want to add, delete, or modify a ResourceClass by adding or removing some metadata constraints or changing its priority. +Second, the properties of a physical resource may change, e.g., through a device plugin or node upgrade. +With the device level allocation information recorded in ContainerSpec, the scheduler can maintain and rebuild the NodeInfo.requestedResource cache information, even though ResourceClasses may be modified or ComputeResource properties may have changed during its offline time. + +We do notice that keeping track of ResourceClass to the underlying compute +resource binding may bring scaling concern on the scheduler. In particular, +during ResourceClass addition, deletion, and update, the scheduler needs to scan +through all the cached NodeInfo in the cluster to update the cached +ResourceClass to ComputeResource matching information. This can be an expensive +operation when the cluster has a lot of nodes, and during the time, pods can not +be scheduled as the scheduler needs to hold cache lock during its NodeInfo +traversal. + +Our proposed plan is to manage the ResourceClass to the underlying compute +resource matching at the scheduler during the initial implementation, so that we have a central place to track this +information. +With the initial implementation, we will add scaling tests to evaluate the +system scaling limits in different dimensions, such as the number of properties +a ComputeResource may expose, the number of devices a ComputeResource may have, the number of nodes in a cluster can have ComputeResource, and the number of ResourceClasses a cluster can have. Based on the performance results we get, we can explore further optimizations such as having a separate controller to watch NodeStatus ComputeResource updates and propagates the cluster-level aggregated compute resource information to the scheduler through a new ComputeResource API object, or limit the dynamic level of updates we allow for ComputeResource property changes (e.g., we may require that ComputeResource property updates on a node requires node drain) and ResourceClass changes (e.g., don't allow existing ResourceClass to be modified). + +### Quota Extension +As mentioned earlier, an important goal of introducing ResourceClass is to allow cluster admins to manage non-native compute resource quota more easily. With the current resource quota system, admins can define the hard limit on the overall quantity of pod resource requests through raw resource name. They don’t have a mechanism to express finer-granularity resource request constraints that map to resources with certain metadata constraints. Following the previous GPU example, admins may want to constraint P100 and K80 GPU requests in a given namespace, other than the total GPU resource requests. By allowing admins to express resource quota constraints at ResourceClass level, we can now support more flexible resource quota specification. + +This flexibility also brings an interesting question on how we can enable this benefit while also maintain backward compatibility. We have previously discussed the importance to maintain backward compatibility and how we may extend the scheduler so that pods requesting extended resources through raw resource name can still be scheduled properly. Now a natural question is whether ResourceClass quota constraints should also limit how pods’ raw resource requests are bound to available ResourceClass resources. + +Note that resource quota enforcement and resource request fitting validation are performed by different components. The former enforcement is implemented in the Quota admission controller, while the later task is performed by the scheduler. Enforcing quota constraints at scheduling time would require significant amount of change. We hope to avoid this complexity at the initial stage. Instead, we propose to keep resource quota at its current meaning, i.e., a resource quota limits how much resource can be requested by all of the pods in their container specs in a given namespace. We feel this still allows cluster admins to gradually take advantage of ResourceClass through a manageable migration path. Again, let us walk through an example migration scenario to illustrate how cluster admins can gradually use ResourceClass to express finer-granularity quota constraints: + +Step 1: Suppose a cluster already has the following quota constraint to limit total number of gpu devices that pods in a given namespace may request: + +```yaml +kind: ResourceQuota +metadata: + name: gpu-quota-example +spec: + Hard: + nvidia.com/gpu: “100” +``` +Step 2: Now suppose the cluster admin want to enforce finer-granularity resource constraints through ResourceClass. Following the previous example, the admin defines two resource classes, gpu-silver that maps to K80 gpu devices, and gpu-gold that maps to P100 gpu devices. The cluster admin then migrates a small percent, say 10%, of workloads in the namespace to request resources through the created ResourceClass names in e.g., 3:2 ratio. The admin can then modify the existing quota spec as follows to express the current resource constraints: + +```yaml +kind: ResourceQuota +metadata: + name: gpu-quota-example +spec: + Hard: + nvidia.com/gpu: “90” + gpu-silver: "6" + gpu-gold: “4” +``` +Step 3: Now suppose the experiment of using ResourceClass is successful, and the cluster admin have converted all the workloads running in the namespace to request resources through ResourClass name, the admin can then enforce the finer granularity resource quota across all workloads by modifying the quota spec as follows: + +```yaml +kind: ResourceQuota +metadata: + name: gpu-quota-example +spec: + Hard: + nvidia.com/gpu: “0” + gpu-silver: "60" + gpu-gold: “40” +``` + +It is easy to notice that we propose to avoid introducing any changes to the existing quota tracking system. It is possible that in the future, we may need to extend the scheduler to enforce certain quota constraints to support more advanced use cases like better batch job scheduling. When that time comes, the ResourceClass API can be extended to directly express quota constraints, but we will leave that discussion outside the current design. + +## Roadmap +### Phase 1: Support ResourceClass based scheduling for node level resources +Defines the ComputeResource API, the ResourceClass API, and extends Kubelet, scheduler, and device plugin API to support ResourceClass based resource requirements. + +### Phase 2: Support auto scaling and cluster resources +As we are offering more flexibility on introducing new types of compute resources into Kubernetes, we are also seeing more challenges on how to make cluster autoscaling work seamlessly with such wide range, dynamically exported, and probably dynamically bound resources. Here is the list of problems we have seen in the past related to auto scaling and auto provisioning in the presence of extended resources. + +- For node level extended resources that are exported by a device plugin, cluster autoscaler needs to know what resources will be exported on a newly created node, how much of such resources will be exported, and how long it will take for the resource to be exported on the node. Otherwise, it would keep creating new nodes for the pending pod during this time gap. +- For cluster level extended resources, their resource provisionings are generally performed dynamically by a separate controller. Cluster autoscaler and auto provisioner need to be taught to filter out the resource requests for such resources for the pending pod so that it can create right type of node based on node level resource requests. Note that Kubelet and the scheduler have the similar need to ignore such resource requests during their PodFitsResources evaluation. By exporting this information through the ResourceClass API, we can ensure a consistent resource evaluation policy among these components. + +During the second phase, we will design and implement a general solution to +support both node level and cluster level extended compute resources with cluster autoscaler. +We may introduce a Scope field in the ResourceClass API to indicate whether it +maps to node level resource or cluster level resource. For node level extended resources, +template ComputeResources can be pre-defined at node group level for cluster +autoscaler to evaluate whether scaling up the node group can satisfy pending pod +resource requirements. + +### Phase 3: Support group resources +I.e., a ResourceClass can represent a group of resources. E.g., a gpu-super may include two gpu devices with high affinity, an infiniband-super may include a high-performance nic plus 1G memory, etc. From 93d0153dd1aec7a1d22ddd3fad1416b1d63054d8 Mon Sep 17 00:00:00 2001 From: vikaschoudhary16 Date: Tue, 10 Jul 2018 05:39:23 -0400 Subject: [PATCH 3/5] Capturing discussion/feedback so far --- keps/sig-node/00014-resource-api.md | 64 +++++++++++++---------------- 1 file changed, 29 insertions(+), 35 deletions(-) diff --git a/keps/sig-node/00014-resource-api.md b/keps/sig-node/00014-resource-api.md index 679c999e537..28eed04e144 100644 --- a/keps/sig-node/00014-resource-api.md +++ b/keps/sig-node/00014-resource-api.md @@ -15,6 +15,7 @@ reviewers: - "@vishh" approvers: - "@sig-node-leads" + - "@sig-scheduling-leads" editor: "@vikaschoudhary16" creation-date: "2018-06-14" last-updated: "2018-06-14" @@ -51,10 +52,13 @@ The device plugin support added in Kubernetes 1.8 makes it easy for vendors to d ### As a cluster operator: - Nodes in my cluster has GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.
**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs gets added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.
-**Can this be solved without resource classes:** A unique taint can be used to represent a category like silver. Nodes can be tainted accordingly depending upon the type of GPUs availability. User pods can use tolerations to steer workloads to the appropriate nodes. But problem is how to restrict a user pod from not using the toleration that it should not be using?
+**Can this be solved without resource classes:** A unique taint can be used to represent a category like silver. Nodes can be tainted accordingly depending upon the type of GPUs availability. User pods can use tolerations to steer workloads to the appropriate nodes. Now there are two problems. First, access control on tolerations and second, mechanism to quota resources. Though a new feature, [Pod Scheduling Policy][], is under design that can address first problem of access control on tolerations, there is no solution for second problem i.e quota control.
+ +[Pod Scheduling Policy]: https://github.com/kubernetes/community/pull/1937 + **How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated. -- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and shares a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.
+- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and share a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.
**Motivation:** Increased performance because of local access. Local access also helps better use of cache
**How Resource classes can solve this:** Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.
**Can this be solved without resource classes:** No @@ -64,10 +68,11 @@ The device plugin support added in Kubernetes 1.8 makes it easy for vendors to d **How Resource classes can solve this:** Quota will be supported on resource class objects and by allowing resource request in user pods via resource class, charging policy can be linked with resource consumption.
**Can this be solved without resource classes:** No -- In my cluster, I have many different classes (different capabilities) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation. +- In my cluster, I have many different classes (different capabilities shown as different resource attributes) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation. Few nodes are connected to data network over 40 Gig NICs and others are connected over normal 1 Gig NICs. I want end user pods to be able to request data network connectivity with high network performance while in default case, data network connectivity is offered via normal 1 Gbps NICs.
+Another example is FPGA NICs with different capabilities. For example, some might have embedded sdn control plane functionality, some might have embedded crypto logic. One workload may want a subset of these FPGA functionalities advertised as resource attributes.
**Motivation:** If some workloads demand higher network bandwidth, it should be possible to run these workloads on selected nodes.
**Can this be solved without resource classes:** Taints and tolerations can help in steering pods but the problem in that there is no way today to have access control over use of tolerations and therefore if multiple users are there, it is not possible to have control on allowed tolerations.
**How Resource classes can solve this:** I can define a ResourceClass for the high-performance NIC with minimum bandwidth requirements, and makes sure only users with proper quota can use such resources. @@ -125,7 +130,7 @@ Vendors can use DevicePlugin API to propagate new hardware features, and provide ## Objectives Essentially, a ResourceClass object maps a non-native compute resource with a specific set of properties to a portable name. Cluster admins can create different ResourceClass objects with the same generic name on different clusters to match the underlying hardware configuration. Users can then use the portable names to consume the matching compute resources. Through this extra abstraction layer, we are hoping to achieve the following goals: - **Allows workloads to request compute resources with wide range of properties in simple and standard way.** We propose to introduce a new `ComputeResource` API and a field, `ComputeResources` in the `Node.Status` to store a list of `ComputeResource` objects. Kubelet can use a `ComputeResource` object to encapsulate the resource metadata information associated with its underlying physical compute resources and propagate this information to the scheduler by appending it to the `ComputeResources` list in the `Node.Status`. With the resource metadata information, the Kubernetes scheduler can determine the fitness of a node for a container resource request expressed through ResourceClass name by evaluating whether the node has enough unallocated units of a ComputeResource matching the property constraints specified in the ResourceClass. This allows the Kubernetes scheduler to take resource property into account to schedule pods on the right node whose hardware configuration meets the specific resource metadata constraints. -- **Allows cluster admins to configure and manage different kinds of non-native compute resources in flexible and simple ways.** A cluster admin creates a ResourceClass object that specifies a portable ResourceClass name (e.g., `fast-nic-gold`), and list of property matching constraints (e.g., `resourceName in (solarflare.com/fast-nic, intel.com/fast-nic)`, or `type=XtremeScale-8000`, or `bandwidth=100G`, or `zone in (us-west1-b, us-west1-c)`). The property matching constraints follow the generic LabelSelector format, which allows us to cover a wide range of resource specific properties. The cluster admin can then define and manage resource quota with the created ResourceClass object. +- **Allows cluster admins to configure and manage different kinds of non-native compute resources in flexible and simple ways.** A cluster admin creates a ResourceClass object that specifies a portable ResourceClass name (e.g., `fast-nic-gold`), and list of property matching constraints (e.g., `resourceName in (solarflare.com/fast-nic, intel.com/fast-nic)`, or `type=XtremeScale-8000`, or `bandwidth=100). The property matching constraints follow the generic LabelSelector format, which allows us to cover a wide range of resource specific properties. The cluster admin can then define and manage resource quota with the created ResourceClass object. - **Allows vendors to export their resources on Kubernetes more easily.** The device plugin that a vendor needs to implement to make their resource available on Kubernetes lives outside Kubernetes core repository. The device plugin API will be extended to pass device properties from device plugin to Kubelet. Kubelet will propagate this information to the scheduler through ComputeResource and the scheduler will match a ComputeResource with certain properties to the best matching ResourceClass, and support resource requests expressed through ResourceClass name. The device plugin only needs to retrieve device properties through some device-specific API or tool, without needing to watch or understand either ComputeResource objects or ResourceClass objects. - **Provides a unified interface to interpret compute resources across various system components such as Quota and Container Resource Spec.** By introducing ResourceClass as a first-class API object, we provide a built-in solution for users to define their special resource constraints through this API, to request such resources through the existing Container Resource Spec, to limit access for such resources through the existing resource Quota component and to ensure their Pods land on the nodes with the matching physical resources through the default Kubernetes scheduling. - **Supports node-level resources as well as cluster-level resources.** Certain types of compute resources are tied to single nodes and are only accessible locally on those nodes. On the other hand, some types of compute resources such as network attached resources, can be dynamically bound to a chosen node that doesn’t have the resource available till the binding finishes. For such resources, the metadata constraints specified through ResourceClass can be consumed by a standalone controller or a scheduler extender during their resource provisioning or scheduler filtering so that the resource can be provisioned properly to meet the specified metadata constraints. @@ -153,6 +158,8 @@ type ResourceClass struct { } type ResourceClassSpec struct { + // raw resource name. E.g.: nvidia.com/gpu + ResourceName string // defines general resource property matching constraints. // e.g.: zone in { us-west1-b, us-west1-c }; type: k80 MetadataRequirements metav1.LabelSelector @@ -168,16 +175,13 @@ kind: ResourceClass metadata: name: nvidia.high.mem spec: + resourceName: "nvidia.com/nvidia-gpu" labelSelector: - matchExpressions: - - key: "Kind" - operator: "In" - values: - - "nvidia-gpu" - key: "memory" operator: "GtEq" values: - - "30G" + - "15G" kind: Pod metadata: @@ -199,12 +203,9 @@ kind: ResourceClass metadata: name: fast.nic spec: + resourceName: "sfc.com/smartNIC" labelSelector: - matchExpressions: - - key: "Kind" - operator: "In" - values: - - "nic" - key: "speed" operator: "GtEq" values: @@ -226,18 +227,21 @@ Above resource class will select all the NICs with speed greater than equal to ### Kubelet Extension On node level, extended resources can be exported automatically by a third-party plugin through the Device Plugin API. We propose to extend the current Device Plugin API that allows device plugins to send to the Kubelet per-device properties during device listing. Exporting device properties at per-device level instead of per-resource level allows a device plugin to manage devices with heterogeneous properties. -After receiving device availability and property information from a device plugin, Kubelet needs to propagate this information to the scheduler so that scheduler can take resource metadata into account when it is making the scheduling decision. We propose to add a new `ComputeResources` array field in NodeStatus API to represent a list of the `ComputeResource` instances where each represent a device resource and the associated resource properties. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. During the initial phase, we plan to start with exporting extended resources through the ComputeResource API but leaves primary resources in its current exporting model. We can extend the ComputeResource model to support primary resources later after getting more experience through the initial phase. Kubelet will update ComputeResources field upon any resource availability or property change for node-level resources. +After receiving device availability and property information from a device plugin, Kubelet needs to propagate this information to the scheduler so that scheduler can take resource metadata into account when it is making the scheduling decision. We propose to add a new `ComputeResourceCapacity` and `ComputeResourceAllocatable`array fields in NodeStatus API. Each will represent a list of the `ComputeResource` instances where each instance in the list represents a device resource and the associated resource properties. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. During the initial phase, we plan to start with exporting extended resources through the ComputeResource API but leaves primary resources in its current exporting model. We can extend the ComputeResource model to support primary resources later after getting more experience through the initial phase. Kubelet will update ComputeResources field upon any resource availability or property change for node-level resources. We propose to start with the following struct definition: ```golang type NodeStatus struct { … - ComputeResources []ComputeResource + ComputeResourceCapacity []ComputeResource + ComputeResourceAllocatable []ComputeResource … } type ComputeResource struct { - // raw resource name. E.g.: nvidia.com/gpu + // unique and deterministically generated. “nodeName-resourceName-propertyHash” naming convention, where propertyHash is generated by calculating a hash over all resource properties + Name string + // raw resource name. E.g.: nvidia.com/nvidia-gpu ResourceName string // resource metadata received from device plugin. // e.g., gpuType: k80, zone: us-west1-b @@ -247,6 +251,10 @@ type ComputeResource struct { Devices []string } ``` +The ComputeResource name needs to be unique and deterministically generated. We propose to use “nodeName-resourceName-propertyHash” naming convention for node level resources, where propertyHash is generated by calculating a hash over all resource properties. The properties of a physical resource may change, e.g., through a driver or node upgrade. When this happens, Kubelet will need to create a new ComputeResource object and delete the old one. However, some pods may already be scheduled on the node and allocated with the old ComputeResource, and it is hard for the scheduler to do its bookkeeping when such pods are still running on the node while the associated ComputeResource is removed. We have a few options to handle this situation. +- First, we may document that to change or remove a device resource, the node has to be drained. This requirement however will complicate device plugin upgrade process. +- Another option is that Kubelet can evict the pods that are allocated with an unexisting ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade. +- To support a less disruptive model, upon resource property change, Kubelet can still export capacity at old ComputeResource name for the devices used by active pods, and exports capacity at new matching ComputeResource name for devices not in use. Only when those pods finish running, that particular node finishes its transition. This approach avoids resource multiple counting and simplifies the scheduler resource accounting. One potential downside is that the transition may take quite long process if there are long running pods using the resource on the nodes. In that case, cluster admins can still drain the node at convenient time to speed up the transition. Note that this approach does add certain code complexity on Kubelet DeviceManager component. Possible fields we may consider to add later include: - `DeviceUnits resource.Quantity`. This field can be used to support fractional @@ -268,30 +276,20 @@ The scheduler needs to watch for NodeStatus ComputeResources field changes and R A natural question is how we should define the matching behavior. Suppose there are two ResourceClass objects. ResourceClass RC1 has metadata matching constraint “property1 = value1”, and ResourceClass RC2 has metadata matching constraint “property2 = value2”. Suppose a ComputeResource has both “property1: value1” and “property2: value2” properties, and so match both ResourceClasses. Now should the scheduler consider this ComputeResource as qualified for both ResourceClasses RC1 and RC2, or only one of them? -We feel the desired answer to this question may vary across different types of resources, properties and use cases. To illustrate this, lets consider the following example: A GPU device plugin in a cluster with different types of GPUs may be configured to export a single property, "Type", at the beginning. To support per GPU type resource quota, cluster admins may define the following ResourceClasses: +We feel the desired answer to this question may vary across different types of resources, properties and use cases. To illustrate this, lets consider the following example: A GPU device plugin in a cluster with different types of GPUs may be configured to advertise under a common ResourceName as "nvidia.com/nvidia-tesla-k80", at the beginning. To support per GPU type resource quota, cluster admins may define the following ResourceClasses: ```yaml kind: ResourceClass metadata: name: nvidia-k80 spec: - labelSelector: - - matchExpressions: - - key: "Type" - operator: "Eq" - values: - - "nvidia-tesla-k80" + resourceName: "nvidia.com/nvidia-tesla-k80" kind: ResourceClass metadata: name: nvidia-p100 spec: - labelSelector: - - matchExpressions: - - key: "Type" - operator: "Eq" - values: - - "nvidia-tesla-p100" + resourceName: "nvidia.com/nvidia-tesla-p100" ``` Later on, suppose the cluster admins add a new GPU node group with a new version of GPU device plugin that exports another resource property "Nvlink" which will be set true for nvidia-tesla-p100 GPUs connected through nvlinks. To utilize this new feature, the cluster admins define the following new ResourceClass with nvlink constraints: @@ -301,12 +299,8 @@ kind: ResourceClass metadata: name: nvidia-p100-nvlink spec: + resourceName: "nvidia.com/nvidia-tesla-p100" labelSelector: - - matchExpressions: - - key: "Type" - operator: "Eq" - values: - - "nvidia-tesla-p100" - key: "Nvlink" operator: "Eq" values: @@ -330,7 +324,7 @@ By default, the value of this field is set to zero, but cluster admins can set it to a higher value, which would prevent its matching compute resources from being matched by lower priority ResourceClasses. i.e., when a ComputeResource matches multiple ResourceClasses with different Priority values, the scheduler will choose those with the highest Priority. -Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints. +Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints and priority higher than any resource class. Because a ComputeResource can match multiple ResourceClasses, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2. From 3cd8f87115b7a32dc8c2c5eeea09847657f73c50 Mon Sep 17 00:00:00 2001 From: vikaschoudhary16 Date: Fri, 13 Jul 2018 13:53:12 -0400 Subject: [PATCH 4/5] Some more updates on user stories --- keps/sig-node/00014-resource-api.md | 147 +++++++++++----------------- 1 file changed, 57 insertions(+), 90 deletions(-) diff --git a/keps/sig-node/00014-resource-api.md b/keps/sig-node/00014-resource-api.md index 28eed04e144..33d45db4349 100644 --- a/keps/sig-node/00014-resource-api.md +++ b/keps/sig-node/00014-resource-api.md @@ -28,9 +28,6 @@ Table of Contents * [Abstract](#abstract) * [Background](#background) * [Use Stories](#user-stories) - * [As a cluster operator](#as-a-cluster-operator) - * [As a developer](#as-a-developer) - * [As a vendor](#as-a-vendor) * [Objectives](#objectives) * [Non Objectives](#non-objectives) * [Components](#components) @@ -49,88 +46,42 @@ We are seeing increasing needs to better support non-native compute resources on The device plugin support added in Kubernetes 1.8 makes it easy for vendors to dynamically export their resources through a plugin API without changing Kubernetes core code. Taking a further step, this document proposes a new resource abstraction API, ResourceClass, that can be used to describe, manage and consume such vendor specific and metadata rich resources in simple and portable ways. ## Use Stories -### As a cluster operator: -- Nodes in my cluster has GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.
-**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs gets added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.
-**Can this be solved without resource classes:** A unique taint can be used to represent a category like silver. Nodes can be tainted accordingly depending upon the type of GPUs availability. User pods can use tolerations to steer workloads to the appropriate nodes. Now there are two problems. First, access control on tolerations and second, mechanism to quota resources. Though a new feature, [Pod Scheduling Policy][], is under design that can address first problem of access control on tolerations, there is no solution for second problem i.e quota control.
- -[Pod Scheduling Policy]: https://github.com/kubernetes/community/pull/1937 - -**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated. - -- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and share a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.
-**Motivation:** Increased performance because of local access. Local access also helps better use of cache
-**How Resource classes can solve this:** Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.
-**Can this be solved without resource classes:** No - -- I want to have quota control on the devices at the granularity of device properties. For example, I want to have a separate quota for ECC enabled GPUs. I want a specific user to not let use more than ‘N’ number of ECC enabled GPUs overall at namespace level.
-**Motivation:** This will make it possible to charge user per specialized hw consumption. Since special HW is costly, as an Operator I want to have this capability.
-**How Resource classes can solve this:** Quota will be supported on resource class objects and by allowing resource request in user pods via resource class, charging policy can be linked with resource consumption.
-**Can this be solved without resource classes:** No - -- In my cluster, I have many different classes (different capabilities shown as different resource attributes) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation. -Few nodes are connected to data network over 40 Gig NICs and others are connected over normal 1 Gig NICs. I want end user pods to be able to request -data network connectivity with high network performance while -in default case, data network connectivity is offered via normal 1 Gbps NICs.
-Another example is FPGA NICs with different capabilities. For example, some might have embedded sdn control plane functionality, some might have embedded crypto logic. One workload may want a subset of these FPGA functionalities advertised as resource attributes.
-**Motivation:** If some workloads demand higher network bandwidth, it should be possible to run these workloads on selected nodes.
-**Can this be solved without resource classes:** Taints and tolerations can help in steering pods but the problem in that there is no way today to have access control over use of tolerations and therefore if multiple users are there, it is not possible to have control on allowed tolerations.
-**How Resource classes can solve this:** I can define a ResourceClass for the high-performance NIC with minimum bandwidth requirements, and makes sure only users with proper quota can use such resources. - -- I want to be able to utilize different 'types' of a HW resource (not necessarily from the same vendor) while not losing workload portability when moving from one cluster/cloud to another. There can be one type of Nvidia GPUs on one cluster and another type of Nvidia GPUs on another cluster. This is example of different ‘types’ of a HW resource(GPU). I want to offer GPUs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these GPUs with a generic resource class name, workload can be migrated from one cluster to another transparently.
- -**Quoting Henry Saputra (From Ebay) for the record:**
->Currently we ask developers to submit resource specifications for GPU using name of the cards to our data center : -> ->"accelerator": { ->"type": "gpu", ->"quantity": "1", ->"labels": { ->"product": "nvidia", ->"family": "tesla", ->"model": "m40" ->} ->} -> ->But when we go to other cloud such as Google or AWS they may not have the same cards. -> ->So I was wondering if we could offer resource such as CUDA cores and memory as resource specifications rather actual name and type of the cards. -> - -**Motivation:** less downtime, optimal use of resources
-**How Resource classes can solve this:** Explained above
-**Can this be solved without resource classes:** No - -### As a developer: -- I want the ability to be able to request devices which have specific capabilities. Eg: GPUs that are Volta or newer.
- **Motivation:** I want minimum guaranteed compute performance
- **Can this be solved without resource classes:**
- - Yes, using node labels and NodeLabelSelectors. - Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations. - - OR, Instead of using resource class, provide flexibility to query resource properties directly in pod container resource requests. - Problem: In a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable. - - **How Resource classes can solve this:** -The Kubernetes scheduler is the central place to map container resource requests expressed through ResourceClass names to the underlying qualified physical resources, which automatically supports metadata aware resource scheduling. - -- As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details. I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec. When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.
-**Motivation:** Separation of concerns between cluster operators and cluster users.
-**Can this be solved without resource classes:**
-Without the additional abstraction layer, consuming the non-standard, metadata-rich compute resources would be fragmented. More likely, we would see cluster providers implement their own solutions to address their user pains, and it would be hard to provide a consistent user experience for consuming extended resources in the future. - -### As a vendor: -- I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware.
-**Motivation:** enables more compute resources and their advanced features on Kubernetes
+- As a cluster operator, I manage different types of GPUs in my cluster. I want the workloads running in the cluster to be able to request and consume different types of GPUs easily, as long as they have enough quota. I want to assign different resource quota on different type of GPUs, eg: Kepler K20, K80, Pascal P40, P100, Volta V100, as they can have very different performance, price, and available units.
+**Motivation:** Empower enterprise customers to consume and manage non-primary resources easily, similar to how they consume and manage primary resources today.
+**Can this be solved without resource classes:** Without ResourceClass, people would rely on `NodeLabels`, `NodeAffinity`, `Taints`, and `Tolerations` to steer workloads to the appropriate nodes, or build their own [non-upstream solutions](https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576) to allow users to specify their resource specific metadata requirements. Workloads would have different experience on consuming non-primary compute resources on k8s. As time goes and more non-upstream solutions were deployed, user experience becomes fragmented across different environments. Furthermore, `NodeLabels` and `Taints` were designed as node level properties. They can't support multiple types of compute resources on a single node, and don't integrate well with resource quota. Even with the recent [Pod Scheduling Policy proposal](https://github.com/kubernetes/community/pull/1937), cluster admins can either allow or deny pods in a namespace to specify a `NodeAffinity` or `Toleration`, but cannot assign different quota to different namespaces.
+**How Resource classes can solve this:** I, operator/admin, create different ResourceClasses for different types of GPUs. User workloads can request different types of GPUs in their `ContainerSpec` resource requests/limits through the corresponding ResourceClass name, in the same way as they request primary resources. Now since resource classes are quota controlled, end-user will be able to consume the requested GPUs only if they have enough quota.
+**Similar use case for network devices:** A cluster can have different types of high-performance NICs and/or infiniband cards, with different performance and cost. E.g., some nodes may have 40 Gig high-performance NICs and some may have 10 Gig high-performance NICs. Some devices may support RDMA and some may not. Different workloads may desire to use different type of high-network access devices depending on their performance and cost tradeoff.
+**Similar use case for FPGA:** A cluster can have different FPGA cards programmed with different functions or supports different functionalities. For example, some might have embedded SDN control plane functionality and some might have embedded crypto logic. One workload may want a subset of these FPGA functionalities advertised as resource attributes.
+ +- As a cluster operator, nodes in my cluster have GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.
+**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs get added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.
+**Can this be solved without resource classes:** A unique taint can be used to represent a category like silver. Nodes can be tainted accordingly depending upon the type of GPUs availability. User pods can use tolerations to steer workloads to the appropriate nodes. Now there are two problems. First, access control on tolerations and second, mechanism to quota resources. Though a new feature, [Pod Scheduling Policy](https://github.com/kubernetes/community/pull/1937), is under design that can address first problem of access control on tolerations, there is no solution for second problem i.e quota control.
+**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated.
+**Similar use case for network devices:** A cluster may have different classes (different capabilities shown as different resource attributes) of a network device. End user’s expectations are met as long as device has a very small subset of these capabilities. Cluster operators want a mechanism where end user can request devices which satisfies their minimum expectation.
+**Similar use case for FPGA:** Some FPGA devices might have embedded SDN control plane functionality, some might have embedded crypto logic. One workload may want a subset of these FPGA functionalities advertised as resource attributes.
+ +- As an user, I want to be able to utilize different 'types' of a HW resource (may be from the same vendor) while not losing workload portability when moving from one cluster/cloud to another. There can be one type of High-performance NIC on one cluster and another type of high-performance NIC on another cluster. I want to offer high-performance NICs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these high-performance NICs with a generic resource class name, workload can be migrated from one cluster to another transparently.
+**Motivation:** Promotes workload portability and less down time.
+**How Resource classes can solve this:** The user can create different resource classes in different environments to match the underlying hardware configurations, but with the same ResourceClass name. This allows workloads to migrate around different environments without changing their workload specs.
+**Can this be solved without resource classes:** No.
+ +- As a vendor, I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware. I want to provide my users some example best-practice configuration specs, with which their applications can use my resource more efficiently.
+**Motivation:** Enables more compute resources and their advanced features on Kubernetes
**Can this be solved without resource classes:**
Yes, Using node labels and NodeLabelSelectors.
Problem: Lack of access control and lack of the ability to differentiate between hardware properties on the same node. E.g., if on the same node, some GPU devices are connected through nvlink while others are connected through PCI-e, vendors don’t have ways to export such resource properties that can have very different performance impacts.
**How Resource classes can solve this:**
-Vendors can use DevicePlugin API to propagate new hardware features, and provide best-practice ResourceClass spec to consume their new hardware or new hardware features on Kubernetes. Vendors don’t need to worry supporting this new hardware would break existing use cases on old hardware because the Kubernetes scheduler takes the resource metadata into account during pod scheduling, and so only pods that explicitly request this new hardware through the corresponding ResourceClass name will be allocated with such resources. +Vendors can use DevicePlugin API to propagate new hardware features, and provide best-practice ResourceClass spec to consume their new hardware or new hardware features on Kubernetes. Vendors don’t need to worry supporting this new hardware would break existing use cases on old hardware because the Kubernetes scheduler takes the resource metadata into account during pod scheduling, and so only pods that explicitly request this new hardware through the corresponding ResourceClass name will be allocated with such resources.
+ +- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and share a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.
+**Motivation:** Provides an infrastructure building block to allow more flexible resource scheduling, through which people can get more optimal use of resources.
+**How Resource classes can solve this:** Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.
+**Can this be solved without resource classes:** No ## Objectives Essentially, a ResourceClass object maps a non-native compute resource with a specific set of properties to a portable name. Cluster admins can create different ResourceClass objects with the same generic name on different clusters to match the underlying hardware configuration. Users can then use the portable names to consume the matching compute resources. Through this extra abstraction layer, we are hoping to achieve the following goals: -- **Allows workloads to request compute resources with wide range of properties in simple and standard way.** We propose to introduce a new `ComputeResource` API and a field, `ComputeResources` in the `Node.Status` to store a list of `ComputeResource` objects. Kubelet can use a `ComputeResource` object to encapsulate the resource metadata information associated with its underlying physical compute resources and propagate this information to the scheduler by appending it to the `ComputeResources` list in the `Node.Status`. With the resource metadata information, the Kubernetes scheduler can determine the fitness of a node for a container resource request expressed through ResourceClass name by evaluating whether the node has enough unallocated units of a ComputeResource matching the property constraints specified in the ResourceClass. This allows the Kubernetes scheduler to take resource property into account to schedule pods on the right node whose hardware configuration meets the specific resource metadata constraints. -- **Allows cluster admins to configure and manage different kinds of non-native compute resources in flexible and simple ways.** A cluster admin creates a ResourceClass object that specifies a portable ResourceClass name (e.g., `fast-nic-gold`), and list of property matching constraints (e.g., `resourceName in (solarflare.com/fast-nic, intel.com/fast-nic)`, or `type=XtremeScale-8000`, or `bandwidth=100). The property matching constraints follow the generic LabelSelector format, which allows us to cover a wide range of resource specific properties. The cluster admin can then define and manage resource quota with the created ResourceClass object. +- **Allows workloads to request compute resources with wide range of properties in simple and standard way.** We propose to introduce a new `ComputeResource` API field in `Node.Status` to store a list of `ComputeResource` objects. Kubelet can encapsulate the resource metadata information associated with its underlying physical compute resources and propagate this information to the scheduler by appending it to the `ComputeResource` list in the `Node.Status`. With the resource metadata information, the Kubernetes scheduler can determine the fitness of a node for a container resource request expressed through ResourceClass name by evaluating whether the node has enough unallocated ComputeResource matching the property constraints specified in the ResourceClass. This allows the Kubernetes scheduler to take resource property into account to schedule pods on the right node whose hardware configuration meets the specific resource metadata constraints. +- **Allows cluster admins to configure and manage different kinds of non-native compute resources in flexible and simple ways.** A cluster admin creates a ResourceClass object that specifies a portable ResourceClass name (e.g., `fast-nic-gold`), and list of property matching constraints (e.g., `resourceName in (solarflare.com/fast-nic intel.com/fast-nic)`, or `type=XtremeScale-8000`, or `bandwidth=100`). The property matching constraints follow the generic LabelSelector format, which allows us to cover a wide range of resource specific properties. The cluster admin can then define and manage resource quota with the created ResourceClass object. - **Allows vendors to export their resources on Kubernetes more easily.** The device plugin that a vendor needs to implement to make their resource available on Kubernetes lives outside Kubernetes core repository. The device plugin API will be extended to pass device properties from device plugin to Kubelet. Kubelet will propagate this information to the scheduler through ComputeResource and the scheduler will match a ComputeResource with certain properties to the best matching ResourceClass, and support resource requests expressed through ResourceClass name. The device plugin only needs to retrieve device properties through some device-specific API or tool, without needing to watch or understand either ComputeResource objects or ResourceClass objects. - **Provides a unified interface to interpret compute resources across various system components such as Quota and Container Resource Spec.** By introducing ResourceClass as a first-class API object, we provide a built-in solution for users to define their special resource constraints through this API, to request such resources through the existing Container Resource Spec, to limit access for such resources through the existing resource Quota component and to ensure their Pods land on the nodes with the matching physical resources through the default Kubernetes scheduling. - **Supports node-level resources as well as cluster-level resources.** Certain types of compute resources are tied to single nodes and are only accessible locally on those nodes. On the other hand, some types of compute resources such as network attached resources, can be dynamically bound to a chosen node that doesn’t have the resource available till the binding finishes. For such resources, the metadata constraints specified through ResourceClass can be consumed by a standalone controller or a scheduler extender during their resource provisioning or scheduler filtering so that the resource can be provisioned properly to meet the specified metadata constraints. @@ -138,7 +89,7 @@ Essentially, a ResourceClass object maps a non-native compute resource with a sp - **Defines an easy and seamless migration path** for clusters to adopt ResourceClass even if they have existing pods requesting compute resources through raw resource name. In particular, suppose a cluster is running some workloads that have non native compute resources, such as `nvidia.com/gpu`, in their pod resource requirements. Such workloads should still be scheduled properly when the cluster admin creates a ResourceClass that matches the gpu resources in the cluster. Furthermore, we want to support the upgrade scenario that new resource properties can be added for a resource, e.g., through device plugin upgrade and cluster admins may define a new ResourceClass based on the newly exported resource properties without breaking the use of old ResourceClasses. ## Non Objectives -- Extends the current resource requirement API of the container spec. The current resource requirement API is basically a “name:value” list. A commonly arising question is whether we should extend this API to support resource metadata requirements. We can consider this as a possible extension orthogonal to the ResourceClass proposal. A primary reason we propose to introduce ResourceClass API first is because non-native compute resources usually lack standard resource properties. Although there are benefits to allow users to directly express their resource metadata requirements in their container spec, it may also compromise workload portability if not used carefully. It is also hard to implment resource quota when users directly declare resource metadata requirements in their Container spec. By introducing ResourceClass as an additional resource abstraction layer, users can express their special resource requirements through a high-level portable name, and cluster admins can configure compute resources properly on different environments to meet such requirements. We feel this helps promote portability and separation of concerns, while still maintains API compatibility. +- Extends the current resource requirement API of the container spec. The current resource requirement API is basically a “name:value” list. A commonly arising question is whether we should extend this API to support resource metadata requirements. We decide to not include this API change in this proposal for the following reasons. First, in a large cluster, computing operators like “greater than”, “less than” during pod scheduling can be a very slow operation and is not scalable. It can cause scaling issues on the scheduler side. Second, non-primary compute resources usually lack standard resource properties. Although there are benefits to allow users to directly express their resource metadata requirements in their container spec, it may also compromise workload portability in longer term. Third, resource quota control will become harder. That is because the current quota admission handler is very simple and just watches for Pod updates and does simple resource request counting to see whether its resource requests in a given namespace is beyond any specified limit or not. If we add resource property selector directly in ContainerSpec (Nvidia non-upstream approach), we will need to extend the current resource quota spec and the quota admission handler quite a lot. It wil also be quite tricky to make sure all pod resource requests are properly quota controlled with the multiple matching behavior allowed by the use of resource property selectors. Fourth, we may consider the resource requirement API change as a possible extension orthogonal to the ResourceClass proposal. By introducing ResourceClass as an additional resource abstraction layer, users can express their special resource requirements through a high-level portable name, and cluster admins can configure compute resources properly on different environments to meet such requirements. We feel this helps promote portability and separation of concerns, while still maintains API compatibility. - Unifies with the StorageClass API. Although ResourceClass shares many similar motivations and requirements as the existing StorageClass API, they focus on different kinds of resources. StorageClass is used to represent storage resources that are stateful and contains special storage semantics. ResourceClass, on the other hand, focuses on stateless compute resources, whose usage is bound to container lifecycle and can’t be shared across multiple nodes at the same time. For these reasons, we don’t plan to unify the two APIs. - Resource overcommitment, fractional resource requirements, native compute resource (i.e., cpu and memory) with special metadata requirements, and group compute resources. They are out of our current scope. @@ -224,22 +175,30 @@ spec: Above resource class will select all the NICs with speed greater than equal to 40 GBPS. +Possible fields we may consider to add later include: +- `AutoProvisionConfig`. This field can be used to specify resource auto provisioning config in different cloud environments. +- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components. +- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource. + +Note we intentially leave these fields out of the initial design to limit the scope +of this proposal. + ### Kubelet Extension On node level, extended resources can be exported automatically by a third-party plugin through the Device Plugin API. We propose to extend the current Device Plugin API that allows device plugins to send to the Kubelet per-device properties during device listing. Exporting device properties at per-device level instead of per-resource level allows a device plugin to manage devices with heterogeneous properties. -After receiving device availability and property information from a device plugin, Kubelet needs to propagate this information to the scheduler so that scheduler can take resource metadata into account when it is making the scheduling decision. We propose to add a new `ComputeResourceCapacity` and `ComputeResourceAllocatable`array fields in NodeStatus API. Each will represent a list of the `ComputeResource` instances where each instance in the list represents a device resource and the associated resource properties. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. During the initial phase, we plan to start with exporting extended resources through the ComputeResource API but leaves primary resources in its current exporting model. We can extend the ComputeResource model to support primary resources later after getting more experience through the initial phase. Kubelet will update ComputeResources field upon any resource availability or property change for node-level resources. +After receiving device availability and property information from a device plugin, Kubelet needs to propagate this information to the scheduler so that scheduler can take resource metadata into account when it is making the scheduling decision. We propose to add a new `ComputeResource` list field in NodeStatus API, where each instance in the list represents a device resource and the associated resource properties. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. During the initial phase, we plan to start with exporting extended resources through the ComputeResource API but leaves primary resources in its current exporting model. We can extend the ComputeResource model to support primary resources later after getting more experience through the initial phase. Kubelet will update ComputeResources field upon any resource availability or property change for node-level resources. We propose to start with the following struct definition: ```golang type NodeStatus struct { … - ComputeResourceCapacity []ComputeResource - ComputeResourceAllocatable []ComputeResource + ComputeResources []ComputeResource … } type ComputeResource struct { - // unique and deterministically generated. “nodeName-resourceName-propertyHash” naming convention, where propertyHash is generated by calculating a hash over all resource properties + // unique and deterministically generated. “nodeName-resourceName-propertyHash” naming convention, + // where propertyHash is generated by calculating a hash over all resource properties Name string // raw resource name. E.g.: nvidia.com/nvidia-gpu ResourceName string @@ -249,10 +208,12 @@ type ComputeResource struct { // list of deviceIds received from device plugin. // e.g., ["nvida0", "nvidia1"] Devices []string + // similar to the above but only contains allocatable devices. + AllocatableDevices []string } ``` The ComputeResource name needs to be unique and deterministically generated. We propose to use “nodeName-resourceName-propertyHash” naming convention for node level resources, where propertyHash is generated by calculating a hash over all resource properties. The properties of a physical resource may change, e.g., through a driver or node upgrade. When this happens, Kubelet will need to create a new ComputeResource object and delete the old one. However, some pods may already be scheduled on the node and allocated with the old ComputeResource, and it is hard for the scheduler to do its bookkeeping when such pods are still running on the node while the associated ComputeResource is removed. We have a few options to handle this situation. -- First, we may document that to change or remove a device resource, the node has to be drained. This requirement however will complicate device plugin upgrade process. +- First, we may document that to change or remove a device resource, the node has to be drained. This requirement however may complicate device plugin upgrade process. - Another option is that Kubelet can evict the pods that are allocated with an unexisting ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade. - To support a less disruptive model, upon resource property change, Kubelet can still export capacity at old ComputeResource name for the devices used by active pods, and exports capacity at new matching ComputeResource name for devices not in use. Only when those pods finish running, that particular node finishes its transition. This approach avoids resource multiple counting and simplifies the scheduler resource accounting. One potential downside is that the transition may take quite long process if there are long running pods using the resource on the nodes. In that case, cluster admins can still drain the node at convenient time to speed up the transition. Note that this approach does add certain code complexity on Kubelet DeviceManager component. @@ -328,14 +289,20 @@ Supporting multiple ResourceClass matching also makes it easy to ensure that exi Because a ComputeResource can match multiple ResourceClasses, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2. -To maintain and propagate ResourceClass to ComputeResource binding information, the scheduler will need to record this information in a newly introduced ContainerSpec field, similar to the existing NodeName field, and Kubelet will need to consume this information. During the initial implementation, we propose to encode the ResourceClass to the underlying compute resource binding information in a new `AllocatedDeviceIDs map[v1.ResourceName][]types.UID` field in ContainerSpec. Adding this field has been discussed as a possible solution to support other use cases, such as third-party resource monitoring and network device plugins. For the purpose to support ResourceClass, we will extend the scheduler NodeInfo cache to store ResourceClass to the matching ComputeResource information on the node. For a given ComputeResource, its capacity will be reflected in NodeInfo.allocatableResource with all matching ResourceClass names. This way, the current node resource fitness evaluation will stay most the same. After a pod is bound to a node, the scheduler will choose the requested number of devices from the matching ComputeResource on the node, and record this information in the mentioned new field. After that, it increases the NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. Note that if the AllocatedDeviceIDs field is pre-specified, scheduler should honor this binding instead of overwriting it, similar to how it handles pre-specified NodeName. +To maintain and propagate ResourceClass to ComputeResource binding information, the scheduler will need to record this information in a newly introduced ContainerSpec field, similar to the existing NodeName field, and Kubelet will need to consume this information. During the initial implementation, we propose to encode the ResourceClass to the underlying compute resource binding information in a new `AllocatedComputeResources` field in ContainerSpec. +```golang +AllocatedComputeResources map[string]AllocatedResourceList +type AllocatedResourceList struct { + ComputeResourceName string + Count int32 +} +``` + +For the purpose to support ResourceClass, we will extend the scheduler NodeInfo cache to store ResourceClass to the matching ComputeResource information on the node. For a given ComputeResource, its capacity will be reflected in NodeInfo.allocatableResource with all matching ResourceClass names. This way, the current node resource fitness evaluation will stay most the same. After a pod is bound to a node, the scheduler will choose the matching ComputeResource on the node, and record this information in the mentioned new field. After that, it increases the NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. Note that if the `AllocatedComputeResource` field is pre-specified, scheduler should honor this binding instead of overwriting it, similar to how it handles pre-specified NodeName. -A main reason we propose to have the scheduler make and record device level -scheduling decision is so that the scheduler can maintain accurate resource acounting information. The matching from a ResourceClass to the underlying compute resources may change -from two kinds of updates. First, cluster admins may want to add, delete, or modify a ResourceClass by adding or removing some metadata constraints or changing its priority. -Second, the properties of a physical resource may change, e.g., through a device plugin or node upgrade. -With the device level allocation information recorded in ContainerSpec, the scheduler can maintain and rebuild the NodeInfo.requestedResource cache information, even though ResourceClasses may be modified or ComputeResource properties may have changed during its offline time. +when cluster admins add, delete, or modify a ResourceClass by adding or removing some metadata constraints or changing its priority. +In such cases, as long as scheduler has already assigned the pod to a node with a ComputeResource, it doesn't matter whether the old ResourceClass would be valid or not. As mentioned above, the scheduler would update NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. E.g., if a ComputeResource matches ResourceClass RC-A first, and the scheduler assign that ComputeResource to two pods, it will have "RC-A: 2" in NodeInfo.requestedResource. Then suppose there is a ResourceClass update comes in and that same ComputeResource now matches another ResourceClass RC-B. The scheduler will have both "RC-A: 2" and "RC-B: 2" in its NodeInfo.requestedResource cache, even though the ComputeResource no longer matches RC-A. This makes sure we will not over-allocate compute resources. And by recording this info in ContainerSpec, scheduler can re-build this cache info after restarts, even though the matching relationship may have changed. We do notice that keeping track of ResourceClass to the underlying compute resource binding may bring scaling concern on the scheduler. In particular, From 55ecd0aa37acd748c19493b82956dfed5191c0d9 Mon Sep 17 00:00:00 2001 From: vikaschoudhary16 Date: Sat, 18 Aug 2018 01:02:53 -0400 Subject: [PATCH 5/5] Some changes to reflect the recent POR --- keps/sig-node/00014-resource-api.md | 176 ++++++++++++++++++---------- 1 file changed, 116 insertions(+), 60 deletions(-) diff --git a/keps/sig-node/00014-resource-api.md b/keps/sig-node/00014-resource-api.md index 33d45db4349..fe242c3878b 100644 --- a/keps/sig-node/00014-resource-api.md +++ b/keps/sig-node/00014-resource-api.md @@ -51,9 +51,9 @@ The device plugin support added in Kubernetes 1.8 makes it easy for vendors to d **Can this be solved without resource classes:** Without ResourceClass, people would rely on `NodeLabels`, `NodeAffinity`, `Taints`, and `Tolerations` to steer workloads to the appropriate nodes, or build their own [non-upstream solutions](https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576) to allow users to specify their resource specific metadata requirements. Workloads would have different experience on consuming non-primary compute resources on k8s. As time goes and more non-upstream solutions were deployed, user experience becomes fragmented across different environments. Furthermore, `NodeLabels` and `Taints` were designed as node level properties. They can't support multiple types of compute resources on a single node, and don't integrate well with resource quota. Even with the recent [Pod Scheduling Policy proposal](https://github.com/kubernetes/community/pull/1937), cluster admins can either allow or deny pods in a namespace to specify a `NodeAffinity` or `Toleration`, but cannot assign different quota to different namespaces.
**How Resource classes can solve this:** I, operator/admin, create different ResourceClasses for different types of GPUs. User workloads can request different types of GPUs in their `ContainerSpec` resource requests/limits through the corresponding ResourceClass name, in the same way as they request primary resources. Now since resource classes are quota controlled, end-user will be able to consume the requested GPUs only if they have enough quota.
**Similar use case for network devices:** A cluster can have different types of high-performance NICs and/or infiniband cards, with different performance and cost. E.g., some nodes may have 40 Gig high-performance NICs and some may have 10 Gig high-performance NICs. Some devices may support RDMA and some may not. Different workloads may desire to use different type of high-network access devices depending on their performance and cost tradeoff.
-**Similar use case for FPGA:** A cluster can have different FPGA cards programmed with different functions or supports different functionalities. For example, some might have embedded SDN control plane functionality and some might have embedded crypto logic. One workload may want a subset of these FPGA functionalities advertised as resource attributes.
+**Similar use case for FPGA:** A cluster can have different FPGA cards programmed with different functions or support different functionalities. For example, some might have embedded SDN control plane functionality and some might have embedded crypto logic. One workload may want a subset of these FPGA functionalities advertised as resource attributes.
-- As a cluster operator, nodes in my cluster have GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.
+- As a cluster operator, nodes in my cluster have GPU HW from different generations. I want to classify GPU devices into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.
**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs get added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.
**Can this be solved without resource classes:** A unique taint can be used to represent a category like silver. Nodes can be tainted accordingly depending upon the type of GPUs availability. User pods can use tolerations to steer workloads to the appropriate nodes. Now there are two problems. First, access control on tolerations and second, mechanism to quota resources. Though a new feature, [Pod Scheduling Policy](https://github.com/kubernetes/community/pull/1937), is under design that can address first problem of access control on tolerations, there is no solution for second problem i.e quota control.
**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated.
@@ -62,8 +62,8 @@ The device plugin support added in Kubernetes 1.8 makes it easy for vendors to d - As an user, I want to be able to utilize different 'types' of a HW resource (may be from the same vendor) while not losing workload portability when moving from one cluster/cloud to another. There can be one type of High-performance NIC on one cluster and another type of high-performance NIC on another cluster. I want to offer high-performance NICs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these high-performance NICs with a generic resource class name, workload can be migrated from one cluster to another transparently.
**Motivation:** Promotes workload portability and less down time.
-**How Resource classes can solve this:** The user can create different resource classes in different environments to match the underlying hardware configurations, but with the same ResourceClass name. This allows workloads to migrate around different environments without changing their workload specs.
**Can this be solved without resource classes:** No.
+**How Resource classes can solve this:** The user can create different resource classes in different environments to match the underlying hardware configurations, but with the same ResourceClass name. This allows workloads to migrate around different environments without changing their workload specs.
- As a vendor, I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware. I want to provide my users some example best-practice configuration specs, with which their applications can use my resource more efficiently.
**Motivation:** Enables more compute resources and their advanced features on Kubernetes
@@ -81,8 +81,8 @@ Vendors can use DevicePlugin API to propagate new hardware features, and provide ## Objectives Essentially, a ResourceClass object maps a non-native compute resource with a specific set of properties to a portable name. Cluster admins can create different ResourceClass objects with the same generic name on different clusters to match the underlying hardware configuration. Users can then use the portable names to consume the matching compute resources. Through this extra abstraction layer, we are hoping to achieve the following goals: - **Allows workloads to request compute resources with wide range of properties in simple and standard way.** We propose to introduce a new `ComputeResource` API field in `Node.Status` to store a list of `ComputeResource` objects. Kubelet can encapsulate the resource metadata information associated with its underlying physical compute resources and propagate this information to the scheduler by appending it to the `ComputeResource` list in the `Node.Status`. With the resource metadata information, the Kubernetes scheduler can determine the fitness of a node for a container resource request expressed through ResourceClass name by evaluating whether the node has enough unallocated ComputeResource matching the property constraints specified in the ResourceClass. This allows the Kubernetes scheduler to take resource property into account to schedule pods on the right node whose hardware configuration meets the specific resource metadata constraints. -- **Allows cluster admins to configure and manage different kinds of non-native compute resources in flexible and simple ways.** A cluster admin creates a ResourceClass object that specifies a portable ResourceClass name (e.g., `fast-nic-gold`), and list of property matching constraints (e.g., `resourceName in (solarflare.com/fast-nic intel.com/fast-nic)`, or `type=XtremeScale-8000`, or `bandwidth=100`). The property matching constraints follow the generic LabelSelector format, which allows us to cover a wide range of resource specific properties. The cluster admin can then define and manage resource quota with the created ResourceClass object. -- **Allows vendors to export their resources on Kubernetes more easily.** The device plugin that a vendor needs to implement to make their resource available on Kubernetes lives outside Kubernetes core repository. The device plugin API will be extended to pass device properties from device plugin to Kubelet. Kubelet will propagate this information to the scheduler through ComputeResource and the scheduler will match a ComputeResource with certain properties to the best matching ResourceClass, and support resource requests expressed through ResourceClass name. The device plugin only needs to retrieve device properties through some device-specific API or tool, without needing to watch or understand either ComputeResource objects or ResourceClass objects. +- **Allows cluster admins to configure and manage different kinds of non-native compute resources in flexible and simple ways.** A cluster admin creates a ResourceClass object that specifies a portable ResourceClass name (e.g., `fast-nic-gold`), and list of property matching constraints (e.g., `type=XtremeScale-8000`, or `bandwidth=100`). The property matching constraints follow the generic LabelSelector format, which allows us to cover a wide range of resource specific properties. The cluster admin can then define and manage resource quota with the created ResourceClass object. +- **Allows vendors to export their resources on Kubernetes more easily.** The device plugin that a vendor needs to implement to make their resource available on Kubernetes lives outside Kubernetes core repository. The device plugin API will be extended to pass device properties from device plugin to Kubelet. Kubelet will propagate this information to the scheduler through ComputeResource and the scheduler will match a ComputeResource with certain properties to the matching ResourceClass, and support resource requests expressed through ResourceClass name. The device plugin only needs to retrieve device properties through some device-specific API or tool, without needing to watch or understand either ComputeResource objects or ResourceClass objects. - **Provides a unified interface to interpret compute resources across various system components such as Quota and Container Resource Spec.** By introducing ResourceClass as a first-class API object, we provide a built-in solution for users to define their special resource constraints through this API, to request such resources through the existing Container Resource Spec, to limit access for such resources through the existing resource Quota component and to ensure their Pods land on the nodes with the matching physical resources through the default Kubernetes scheduling. - **Supports node-level resources as well as cluster-level resources.** Certain types of compute resources are tied to single nodes and are only accessible locally on those nodes. On the other hand, some types of compute resources such as network attached resources, can be dynamically bound to a chosen node that doesn’t have the resource available till the binding finishes. For such resources, the metadata constraints specified through ResourceClass can be consumed by a standalone controller or a scheduler extender during their resource provisioning or scheduler filtering so that the resource can be provisioned properly to meet the specified metadata constraints. - **Supports cluster auto scaling for extended resources.** We have seen challenges on how to make cluster autoscaler work seamlessly with dynamically exported extended resources. In particular, for node level extended resources that are exported by a device plugin, cluster autoscaler needs to know what resources will be exported on a newly created node, how much of such resources will be exported and how long it will take for the resource to be exported on the node. Otherwise, it would keep creating new nodes for the pending pod during this time gap. For cluster level extended resources, their resource provisionings are generally performed dynamically by a separate controller. Cluster autoscaler needs to be taught to filter out the resource requests for such resources for the pending pod so that it can create right type of node based on node level resource requests. Note that Kubelet and the scheduler have the similar need to ignore such resource requests during their `PodFitsResources` evaluation. As we are introducing the new resource API that can be used to export arbitrary resource metadata along with extended resources, we need to define a general mechanism for cluster autoscaler to learn the upcoming resource property and capacity on a new node and ensure a consistent resource evaluation policy among cluster autoscaler, scheduler and Kubelet. @@ -111,12 +111,25 @@ type ResourceClass struct { type ResourceClassSpec struct { // raw resource name. E.g.: nvidia.com/gpu ResourceName string - // defines general resource property matching constraints. - // e.g.: zone in { us-west1-b, us-west1-c }; type: k80 - MetadataRequirements metav1.LabelSelector - // used to compare preference of two matching ResourceClasses - // The purpose to introduce this field is explained more later - Priority int + // ResourceSelector selects resources. ORed from each selector. + resourceSelector []ResourcePropertySelector +} + +type ResourcePropertySelector struct { + // A list of resource/device selector requirements + MatchExpressions []ResourceSelectorRequirement +} + +// A resource selector requirement is a selector that contains values, a key, and an operator +// that relates the key and values +type ResourceSelectorRequirement struct { + // The label key that the selector applies to + Key string + // Example 0.1, intel etc + Values []string + // Similar to NodeSelectorOperator. Valid operators are: "In", "NotIn", + // "Exists", "DoesNotExist", "Gt", and "Lt". + Operator ResourceSelectorOperator } ``` @@ -126,11 +139,11 @@ kind: ResourceClass metadata: name: nvidia.high.mem spec: - resourceName: "nvidia.com/nvidia-gpu" - labelSelector: + resourceName: "nvidia.com/gpu" + resourceSelector: - matchExpressions: - key: "memory" - operator: "GtEq" + operator: "Gt" values: - "15G" @@ -145,8 +158,7 @@ spec: nvidia.high.mem: 2 ``` -Above resource class will select all the nvidia-gpus which have memory greater -than and equal to 30 GB. +Above resource class will select all the nvidia-gpus which have memory greater than 15GB. YAML example 2: ```yaml @@ -158,7 +170,7 @@ spec: labelSelector: - matchExpressions: - key: "speed" - operator: "GtEq" + operator: "Gt" values: - "40GBPS" @@ -172,8 +184,7 @@ spec: limits: fast.nic: 1 ``` -Above resource class will select all the NICs with speed greater than equal to -40 GBPS. +Above resource class will select all the NICs with speed greater than 40GBPS. Possible fields we may consider to add later include: - `AutoProvisionConfig`. This field can be used to specify resource auto provisioning config in different cloud environments. @@ -186,7 +197,7 @@ of this proposal. ### Kubelet Extension On node level, extended resources can be exported automatically by a third-party plugin through the Device Plugin API. We propose to extend the current Device Plugin API that allows device plugins to send to the Kubelet per-device properties during device listing. Exporting device properties at per-device level instead of per-resource level allows a device plugin to manage devices with heterogeneous properties. -After receiving device availability and property information from a device plugin, Kubelet needs to propagate this information to the scheduler so that scheduler can take resource metadata into account when it is making the scheduling decision. We propose to add a new `ComputeResource` list field in NodeStatus API, where each instance in the list represents a device resource and the associated resource properties. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. During the initial phase, we plan to start with exporting extended resources through the ComputeResource API but leaves primary resources in its current exporting model. We can extend the ComputeResource model to support primary resources later after getting more experience through the initial phase. Kubelet will update ComputeResources field upon any resource availability or property change for node-level resources. +After receiving device availability and property information from a device plugin, Kubelet needs to propagate this information to the scheduler so that scheduler can take resource metadata into account when it is making the scheduling decision. We propose to add a new `ComputeResource` list field in NodeStatus API, where each instance in the list represents a resource with unique set of resource properties. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. During the initial phase, we plan to start with exporting extended resources through the ComputeResource API but leaves primary resources in its current exporting model. We can extend the ComputeResource model to support primary resources later after getting more experience through the initial phase. Kubelet will update ComputeResources field upon any resource availability or property change for node-level resources. We propose to start with the following struct definition: @@ -197,7 +208,7 @@ type NodeStatus struct { … } type ComputeResource struct { - // unique and deterministically generated. “nodeName-resourceName-propertyHash” naming convention, + // unique and deterministically generated. “resourceName-propertyHash” naming convention, // where propertyHash is generated by calculating a hash over all resource properties Name string // raw resource name. E.g.: nvidia.com/nvidia-gpu @@ -212,16 +223,20 @@ type ComputeResource struct { AllocatableDevices []string } ``` -The ComputeResource name needs to be unique and deterministically generated. We propose to use “nodeName-resourceName-propertyHash” naming convention for node level resources, where propertyHash is generated by calculating a hash over all resource properties. The properties of a physical resource may change, e.g., through a driver or node upgrade. When this happens, Kubelet will need to create a new ComputeResource object and delete the old one. However, some pods may already be scheduled on the node and allocated with the old ComputeResource, and it is hard for the scheduler to do its bookkeeping when such pods are still running on the node while the associated ComputeResource is removed. We have a few options to handle this situation. -- First, we may document that to change or remove a device resource, the node has to be drained. This requirement however may complicate device plugin upgrade process. -- Another option is that Kubelet can evict the pods that are allocated with an unexisting ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade. +The ComputeResource name needs to be unique and deterministically generated. We propose to use “resourceName-propertyHash” naming convention for node level resources, where propertyHash is generated by calculating a hash over all resource properties. The properties of a physical resource may change, e.g., through a driver or node upgrade. When this happens, Kubelet will need to create a new ComputeResource item and delete the old one. However, some pods may already be scheduled on the node and allocated with the old ComputeResource, and it is hard for the scheduler to do its bookkeeping when such pods are still running on the node while the associated ComputeResource is removed. We have a few options to handle this situation. +- First, we may document that to change or remove a resource property, the node has to be drained. This requirement however may complicate device plugin upgrade process. +- Another option is that Kubelet can evict the pods that are allocated with a non-existing ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade. - To support a less disruptive model, upon resource property change, Kubelet can still export capacity at old ComputeResource name for the devices used by active pods, and exports capacity at new matching ComputeResource name for devices not in use. Only when those pods finish running, that particular node finishes its transition. This approach avoids resource multiple counting and simplifies the scheduler resource accounting. One potential downside is that the transition may take quite long process if there are long running pods using the resource on the nodes. In that case, cluster admins can still drain the node at convenient time to speed up the transition. Note that this approach does add certain code complexity on Kubelet DeviceManager component. +We propose to start with the first option, i.e., device property change requires +a node drain, for its simplicity. We can re-evaluate the other options after the +initial phase as we gather more experience on managing resource life cycles. + Possible fields we may consider to add later include: - `DeviceUnits resource.Quantity`. This field can be used to support fractional resource or infinite resource. In a more advanced use case, a device plugin may even advertise a single Device with X DeviceUnits so that it can make its own - device allocation decisions, although this usually require the device plugin + device allocation decisions, although this usually requires the device plugin to implement its own complex logic to track resource life cycle. - `Owner string`. Can be Kubelet or some cluster-level controller to indicate the ownership and scope of the resource. @@ -233,76 +248,117 @@ Note we intentially leave these fields out of the initial design to limit the sc of this proposal. ### Scheduler Extension -The scheduler needs to watch for NodeStatus ComputeResources field changes and ResourceClass object updates and caches the binding information between the ResourceClass and the matchingComputeResources so that it can serve container resource request expressed through ResourceClass names. +The scheduler needs to watch for Node.Status.ComputeResources field changes and ResourceClass object updates, and caches the binding information between the ResourceClass and the matchingComputeResources so that it can serve container resource request expressed through ResourceClass names. A natural question is how we should define the matching behavior. Suppose there are two ResourceClass objects. ResourceClass RC1 has metadata matching constraint “property1 = value1”, and ResourceClass RC2 has metadata matching constraint “property2 = value2”. Suppose a ComputeResource has both “property1: value1” and “property2: value2” properties, and so match both ResourceClasses. Now should the scheduler consider this ComputeResource as qualified for both ResourceClasses RC1 and RC2, or only one of them? -We feel the desired answer to this question may vary across different types of resources, properties and use cases. To illustrate this, lets consider the following example: A GPU device plugin in a cluster with different types of GPUs may be configured to advertise under a common ResourceName as "nvidia.com/nvidia-tesla-k80", at the beginning. To support per GPU type resource quota, cluster admins may define the following ResourceClasses: +To answer this question, lets consider the following example. +A GPU device plugin versionX exports a single device attribute “gpuType” that can be “k80” or “p100” in a cluster with these two types of GPUs. Cluster admins define two ResourceClasses accordingly. ```yaml kind: ResourceClass metadata: - name: nvidia-k80 +  name: nvidia-k80 spec: - resourceName: "nvidia.com/nvidia-tesla-k80" +  resourceName: "nvidia.com/gpu" + resourceSelector: + - matchExpressions: +        - key: "gpuType" +          operator: "In" +          values: +            - "k80" + kind: ResourceClass metadata: - name: nvidia-p100 +  name: nvidia-p100 spec: - resourceName: "nvidia.com/nvidia-tesla-p100" +  resourceName: "nvidia.com/gpu" + resourceSelector: + - matchExpressions: +        - key: "gpuType" +          operator: "In" +          values: +            - "p100" ``` -Later on, suppose the cluster admins add a new GPU node group with a new version of GPU device plugin that exports another resource property "Nvlink" which will be set true for nvidia-tesla-p100 GPUs connected through nvlinks. To utilize this new feature, the cluster admins define the following new ResourceClass with nvlink constraints: +Suppose in the cluster, some of the p100 GPUs have ECC enabled and some don’t. Cluster admins want to differentiate these two types of p100 GPUs so users that need ECC enabled GPUs (e.g: Inference services) can request these specific GPUs. The cluster admin rolls out a new version of device plugin versionY that exports a new attribute, supportECC. The device plugin versionY may only run on some new version of nodes in the cluster while some old version of nodes are still running device plugin versionX. Cluster admins then define another ResourceClass: ```yaml kind: ResourceClass metadata: - name: nvidia-p100-nvlink + name: nvidia-p100-ecc spec: - resourceName: "nvidia.com/nvidia-tesla-p100" - labelSelector: - - key: "Nvlink" - operator: "Eq" + resourceName: "nvidia.com/gpu" + resourceSelector: + - matchExpressions: + - key: "gpuType" + operator: "In" + values: + - "k80" + - key: "supportECC" + operator: "In" values: - "true" ``` -Now we face the question that whether the scheduler should allow Pods requesting -"nvidia-p100" to land on a node in this new GPU node groups. So far, we have -received different feedbacks on this question. In some use cases, users would -like to have minimum matching behavior that as long as the underlying hardware -matches the minimal requirements specified through ResourceClass contraints, -they want to allow Pods to be scheduled on the hardware. On the other hand, some users desire to reserve expensive hardware resources for users who explicitly request them. -We feel both use cases are valid requirements. Allowing a ComputeResource to match -multiple ResourceClasses as long as it matches their matching constraints +Cluster admin may consider nvidia-p100-ecc as precious resource and wants to reserve them for special workload pods explicitly request them. They can achieve this by updating the nvidia-p100 ResourceClass as follows: + +```yaml +kind: ResourceClass +metadata: + name: nvidia-p100 +spec: + resourceName: "nvidia.com/gpu" + resourceSelectorRequirements: + - matchExpressions: + - key: "gpuType" + operator: "In" + values: + - "p100" + - key: "supportECC" + operator: "NotIn" + values: + - "true" +``` + +On the other hand, cluster admin may want to allow pods requesting nvidia-p100 to use ecc p100 GPUs if they are idle, but relies on scheduler preemption to re-assign those devices to pods requesting nvidia-p100-ecc and with higher priority. Such use cases require the scheduler support on matching a ComputeResource to multiple qualified ResourceClasses. +We feel this model perhaps yields least surprising behavior to users and also simplies upgrade -scenario as new resource properties are introduced into the system. Therefore we -support this behavior by default. To also provide an easy way for cluster admins -to reserve expensive compute resources and control their access with resource -quota, we propose to include a Priority field in ResourceClass API. -By default, the value of this field is set to zero, but cluster admins can set -it to a higher value, which would prevent its matching compute resources from -being matched by lower priority ResourceClasses. i.e., -when a ComputeResource matches multiple ResourceClasses with different Priority values, the scheduler will choose those with the highest Priority. -Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints and priority higher than any resource class. - -Because a ComputeResource can match multiple ResourceClasses, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2. +scenario as new resource properties are introduced into the system. +Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints. + +Because a ResourceClass can match multiple ComputeResources on a node, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2. To maintain and propagate ResourceClass to ComputeResource binding information, the scheduler will need to record this information in a newly introduced ContainerSpec field, similar to the existing NodeName field, and Kubelet will need to consume this information. During the initial implementation, we propose to encode the ResourceClass to the underlying compute resource binding information in a new `AllocatedComputeResources` field in ContainerSpec. + ```golang +// Key is ResourceClass name. AllocatedComputeResources map[string]AllocatedResourceList -type AllocatedResourceList struct { + +type AllocatedResourceList []AllocatedResource + +type AllocatedResource struct { ComputeResourceName string Count int32 } ``` -For the purpose to support ResourceClass, we will extend the scheduler NodeInfo cache to store ResourceClass to the matching ComputeResource information on the node. For a given ComputeResource, its capacity will be reflected in NodeInfo.allocatableResource with all matching ResourceClass names. This way, the current node resource fitness evaluation will stay most the same. After a pod is bound to a node, the scheduler will choose the matching ComputeResource on the node, and record this information in the mentioned new field. After that, it increases the NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. Note that if the `AllocatedComputeResource` field is pre-specified, scheduler should honor this binding instead of overwriting it, similar to how it handles pre-specified NodeName. +For the purpose to support ResourceClass, we will extend the scheduler NodeInfo cache to store ResourceClass to the matching ComputeResource information on the node. For a given ComputeResource, its capacity will be reflected in NodeInfo.allocatableResource with all matching ResourceClass names. This way, the current node resource fitness evaluation will stay most the same. + +When the scheduler is assuming a Pod on a chosen node, it selects the qualified ComputeResources for +the Pod's ResourceClass requests, updates its local cache to record this assumed binding, and increases the +cached NodeInfo.requestedResource for all of the matching ResourceClass names of every chosen ComputeResource. +Note that if the `AllocatedComputeResource` field is pre-specified, scheduler should honor this binding instead of overwriting it, similar to how it handles pre-specified NodeName. + +We will extend the scheduler Binder that other than binding the selected +node to the pod, it also binds the selected ComputeResources to the requested +ResourceClasses. The matching from a ResourceClass to the underlying compute resources may change -when cluster admins add, delete, or modify a ResourceClass by adding or removing some metadata constraints or changing its priority. -In such cases, as long as scheduler has already assigned the pod to a node with a ComputeResource, it doesn't matter whether the old ResourceClass would be valid or not. As mentioned above, the scheduler would update NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. E.g., if a ComputeResource matches ResourceClass RC-A first, and the scheduler assign that ComputeResource to two pods, it will have "RC-A: 2" in NodeInfo.requestedResource. Then suppose there is a ResourceClass update comes in and that same ComputeResource now matches another ResourceClass RC-B. The scheduler will have both "RC-A: 2" and "RC-B: 2" in its NodeInfo.requestedResource cache, even though the ComputeResource no longer matches RC-A. This makes sure we will not over-allocate compute resources. And by recording this info in ContainerSpec, scheduler can re-build this cache info after restarts, even though the matching relationship may have changed. +when cluster admins add, delete, or modify a ResourceClass by adding or removing some metadata constraints. +In such cases, as long as scheduler has already assigned the pod to a node with a ComputeResource, it doesn't matter whether the old ResourceClass would be valid or not. +As mentioned above, the scheduler would update NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. E.g., if a ComputeResource matches ResourceClass RC-A first, and the scheduler assign that ComputeResource to two pods, it will have "RC-A: 2" in NodeInfo.requestedResource. Then suppose there is a ResourceClass update comes in and that same ComputeResource now matches another ResourceClass RC-B. The scheduler will have both "RC-A: 2" and "RC-B: 2" in its NodeInfo.requestedResource cache, even though the ComputeResource no longer matches RC-A. This makes sure we will not over-allocate compute resources. And by recording this info in ContainerSpec, scheduler can re-build this cache info after restarts, even though the matching relationship may have changed. We do notice that keeping track of ResourceClass to the underlying compute resource binding may bring scaling concern on the scheduler. In particular,