Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Cluster Scoped Resources #1400

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Cluster Scoped Resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow the KEP process outlined by @kubernetes/sig-architecture-feature-requests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is KEP now a requirement or a recommendation? That was not clear from the contributor summit discussions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @jdumars

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Abstract
Cluster scoped resources are consumable resources that do not belong to any specific node but instead are available across mulitple nodes in a cluster. These resources are accounted as other consumable resources and should be usable by the scheduler while deciding if a pod can actually be scheduled.


## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Software licenses are the most common reason for such features in other systems.

Resources in Kubernetes such as cpu and memory are available at a node level and can be consumed by pods by requesting them. However there are some resources that do not belong to a specific node, but they are consumable across all or a group of nodes in the cluster. Few such use cases are mentioned below.

#### Use Cases
1. Software Licenses that can be shared by pods across the entire cluster

2. IP Addresses:
Nodes in a cluster can be partitioned into multiple network scopes and each network scope can have a certain number of available IPs that can be assigned to pods running on those nodes. Hence, the nodes belonging to a network scope can collectively run as many pods as the IP capacity in that scope.

3. Rack storage:
Locally attached shared storage in a rack, which is consumable by pods on nodes within a rack

4. Network Bandwidth:
Network bandwidth shared by pods. Depending on the network topology of a cluster, pods on multiple nodes will be sharing network bandwidth with each other. In use cases where pods need guranteed network throughput, representing bandwidth as a cluster resource is essential for scheduling such pods.

## Goals
The goal is to define mechanisms to expose and consume cluster scoped resources

## Design

### ClusterResource type
```
// pkg/api/types.go:

// ClusterResourceQuantity represents quantity of a ClusterResource
type ClusterResourceQuantity struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the discovery/initialization flow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster admin or other controllers will post the ClusterResource objects that captures the capacity and allocatable quantities of aClusterResource, which will then be used by scheduler

Quantity resource.Quantity `json:"quantity"`
// NodeSelector is a label query over nodes which collectively provide resource Quantity
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
}

// ClusterResource represents a resource which is available at a cluster level
type ClusterResource struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a form of quota, it should be named as such - ClusterResourceNodeQuota. It’s not actually clear how this api aligns with ResourceQuota, please comment to that effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterResource is an api type that represents a cluster scoped resource. However it's integration with resourcequotas needs to be added, probably at later a phase such as beta?

TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty"`
// +optional
Status ClusterResourceStatus `json:"status,omitempty"`
}

type ClusterResourceStatus struct {
// Capacity represents the total quantity of ClusterResource
// +optional
Capacity []ClusterResourceQuantity `json:"capacity,omitempty"`
// Allocatable represents the quantity of ClusterResource that is available for scheduling
// +optional
Allocatable []ClusterResourceQuantity `json:"allocatable,omitempty"`
}
```
`ClusterResourceStatus` captures the capacity and allocatable quantity for a `ClusteResource` in the form of `ClusterResourceQuantity`. `ClusterResourceQuantity` represents the quantity of a `ClusterResource` which is collectively consumable on nodes selected by `NodeSelector`.
`NodeSelector` is a label query over the nodes, which collectively provide this `ClusterResource`. This field is optional, and if not specfied, it means that the `ClusterResource` is consumable across all nodes in the cluster.


### Consuming ClusterResources

ClusterResources are consumable by pods just like CPU and memory, by specifying it in the pod request. The scheduler should take care of the resource accounting for ClusterResources so that no more than the available amount is simultaneously allocated to Pods. The prefix used to identify a ClusterResource coule be
```
pod.alpha.kubernetes.io/cluster-resource-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of special prefixes. I'd like to see if we can avoid overloading resource names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we just moved away from this pattern with extended resources.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can follow fully-qualified resource names similar to extended resources, but we need to see how will those be differentiated

```

### Accounting in scheduler
Copy link
Member

@bsalamat bsalamat Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have extended resources and several types of first class resources in the scheduler already. I think it would be possible to come up with a single presentation that covers all of these types. For example, I don't see much of a difference between a cluster resource and extended resource from scheduler's point of view. An extended resource with an additional "type" can represent a cluster resource.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference between these two resources is the "scope".
Currently extended resources are exposed as a part of node status because of their nature of being tied to a node, while cluster scoped resources have to be represented outside the scope of a node. But we can surely have a comprehensive API that covers both. From the scheduler's point of view, it will need some additional logic to calculate and cache the available capacity of a cluster scoped resource across a set of nodes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the "ResourceClass" that @jiayingz is working on is an effort in that direction to provide a comprehensive API to represent various types of resources, including cluster resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As @bsalamat mentioned, we are working on a new Resource API proposal that aims to provide a comprehensive API for both node-level resources and cluster-level resources. Here is the current PR:
#782
It is still WIP and the current plan is to focus on node-level resources during the initial phase. But I think even the initial API should help solve some of the listed problems here. Please take a look and let us know if you see any missing pieces.


ClusterResources should be tracked as normal consumable resources and should be considered by the scheduler when determining if a pod can actually be scheduled
Copy link
Member

@bsalamat bsalamat Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another important aspect of cluster resources which is not covered here is how to bind these resources to a chosen node during/after scheduling. A fairly complex logic is already added to scheduler to handle provisioning and binding PVs to nodes during scheduling. Similar processes may be needed for other resources, such as TPUs, etc. I think that aspect should be covered by the proposal.

@vishh @jiayingz

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover that aspect, I prefer the approach mentioned by @davidopp in his previous comment. The external agent/controller which exposes the available capacity of this resource can be made responsible for binding or making sure that those resources are ready to use when a pod is going to run on a node. Similarly when a pod dies, that agent needs to deallocate/unbind the corresponding resource and increment the available quantity so that it can be used for scheduling of new pods


```
// kubernetes/plugin/pkg/scheduler/schedulercache/cluster_info.go

// ClusterInfo is cluster level aggregated information.
type ClusterInfo struct {

clusterResources map[string]*ClusterResource
}

// kubernetes/plugin/pkg/scheduler/schedulercache/cache.go

type schedulerCache struct {
...
cluster *ClusterInfo
}
```

`clusterinfo` is added to scheduler cache to do accounting for ClusterResources consumed by pods. `clusterInfo` will be exposed to the predicate and priority functions in order to take ClusterResources into consideration while making scheduling decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How clusterinfo will be build?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterinfo can be build similar to how we build nodeInfo since scheduler will be watching for ClusterResources