-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Cluster Scoped Resources #1400
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# Cluster Scoped Resources | ||
|
||
## Abstract | ||
Cluster scoped resources are consumable resources that do not belong to any specific node but instead are available across mulitple nodes in a cluster. These resources are accounted as other consumable resources and should be usable by the scheduler while deciding if a pod can actually be scheduled. | ||
|
||
|
||
## Motivation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Software licenses are the most common reason for such features in other systems. |
||
Resources in Kubernetes such as cpu and memory are available at a node level and can be consumed by pods by requesting them. However there are some resources that do not belong to a specific node, but they are consumable across all or a group of nodes in the cluster. Few such use cases are mentioned below. | ||
|
||
#### Use Cases | ||
1. Software Licenses that can be shared by pods across the entire cluster | ||
|
||
2. IP Addresses: | ||
Nodes in a cluster can be partitioned into multiple network scopes and each network scope can have a certain number of available IPs that can be assigned to pods running on those nodes. Hence, the nodes belonging to a network scope can collectively run as many pods as the IP capacity in that scope. | ||
|
||
3. Rack storage: | ||
Locally attached shared storage in a rack, which is consumable by pods on nodes within a rack | ||
|
||
4. Network Bandwidth: | ||
Network bandwidth shared by pods. Depending on the network topology of a cluster, pods on multiple nodes will be sharing network bandwidth with each other. In use cases where pods need guranteed network throughput, representing bandwidth as a cluster resource is essential for scheduling such pods. | ||
|
||
## Goals | ||
The goal is to define mechanisms to expose and consume cluster scoped resources | ||
|
||
## Design | ||
|
||
### ClusterResource type | ||
``` | ||
// pkg/api/types.go: | ||
|
||
// ClusterResourceQuantity represents quantity of a ClusterResource | ||
type ClusterResourceQuantity struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is the discovery/initialization flow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cluster admin or other controllers will post the |
||
Quantity resource.Quantity `json:"quantity"` | ||
// NodeSelector is a label query over nodes which collectively provide resource Quantity | ||
// +optional | ||
NodeSelector map[string]string `json:"nodeSelector,omitempty"` | ||
} | ||
|
||
// ClusterResource represents a resource which is available at a cluster level | ||
type ClusterResource struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is a form of quota, it should be named as such - ClusterResourceNodeQuota. It’s not actually clear how this api aligns with ResourceQuota, please comment to that effect. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
TypeMeta `json:",inline"` | ||
ObjectMeta `json:"metadata,omitempty"` | ||
// +optional | ||
Status ClusterResourceStatus `json:"status,omitempty"` | ||
} | ||
|
||
type ClusterResourceStatus struct { | ||
// Capacity represents the total quantity of ClusterResource | ||
// +optional | ||
Capacity []ClusterResourceQuantity `json:"capacity,omitempty"` | ||
// Allocatable represents the quantity of ClusterResource that is available for scheduling | ||
// +optional | ||
Allocatable []ClusterResourceQuantity `json:"allocatable,omitempty"` | ||
} | ||
``` | ||
`ClusterResourceStatus` captures the capacity and allocatable quantity for a `ClusteResource` in the form of `ClusterResourceQuantity`. `ClusterResourceQuantity` represents the quantity of a `ClusterResource` which is collectively consumable on nodes selected by `NodeSelector`. | ||
`NodeSelector` is a label query over the nodes, which collectively provide this `ClusterResource`. This field is optional, and if not specfied, it means that the `ClusterResource` is consumable across all nodes in the cluster. | ||
|
||
|
||
### Consuming ClusterResources | ||
|
||
ClusterResources are consumable by pods just like CPU and memory, by specifying it in the pod request. The scheduler should take care of the resource accounting for ClusterResources so that no more than the available amount is simultaneously allocated to Pods. The prefix used to identify a ClusterResource coule be | ||
``` | ||
pod.alpha.kubernetes.io/cluster-resource- | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not a fan of special prefixes. I'd like to see if we can avoid overloading resource names. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, we just moved away from this pattern with extended resources. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can follow fully-qualified resource names similar to extended resources, but we need to see how will those be differentiated |
||
``` | ||
|
||
### Accounting in scheduler | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have extended resources and several types of first class resources in the scheduler already. I think it would be possible to come up with a single presentation that covers all of these types. For example, I don't see much of a difference between a cluster resource and extended resource from scheduler's point of view. An extended resource with an additional "type" can represent a cluster resource. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The key difference between these two resources is the "scope". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the "ResourceClass" that @jiayingz is working on is an effort in that direction to provide a comprehensive API to represent various types of resources, including cluster resources. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. As @bsalamat mentioned, we are working on a new Resource API proposal that aims to provide a comprehensive API for both node-level resources and cluster-level resources. Here is the current PR: |
||
|
||
ClusterResources should be tracked as normal consumable resources and should be considered by the scheduler when determining if a pod can actually be scheduled | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another important aspect of cluster resources which is not covered here is how to bind these resources to a chosen node during/after scheduling. A fairly complex logic is already added to scheduler to handle provisioning and binding PVs to nodes during scheduling. Similar processes may be needed for other resources, such as TPUs, etc. I think that aspect should be covered by the proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To cover that aspect, I prefer the approach mentioned by @davidopp in his previous comment. The external agent/controller which exposes the available capacity of this resource can be made responsible for binding or making sure that those resources are ready to use when a pod is going to run on a node. Similarly when a pod dies, that agent needs to deallocate/unbind the corresponding resource and increment the available quantity so that it can be used for scheduling of new pods |
||
|
||
``` | ||
// kubernetes/plugin/pkg/scheduler/schedulercache/cluster_info.go | ||
|
||
// ClusterInfo is cluster level aggregated information. | ||
type ClusterInfo struct { | ||
|
||
clusterResources map[string]*ClusterResource | ||
} | ||
|
||
// kubernetes/plugin/pkg/scheduler/schedulercache/cache.go | ||
|
||
type schedulerCache struct { | ||
... | ||
cluster *ClusterInfo | ||
} | ||
``` | ||
|
||
`clusterinfo` is added to scheduler cache to do accounting for ClusterResources consumed by pods. `clusterInfo` will be exposed to the predicate and priority functions in order to take ClusterResources into consideration while making scheduling decisions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow the KEP process outlined by @kubernetes/sig-architecture-feature-requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is KEP now a requirement or a recommendation? That was not clear from the contributor summit discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @jdumars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh @timothysc: is this the template that needs to be followed: https://github.com/kubernetes/community/blob/master/keps/0000-kep-template.md