From d3de0da85d9ab13b0a359703a25b55707aff5d54 Mon Sep 17 00:00:00 2001 From: Francesco Romani Date: Tue, 2 Feb 2021 17:54:56 +0100 Subject: [PATCH] Add GetAllocatableResource to PodResource API In order to simplify and make more understandable the KEP, and to comply with the new process, we extract the unit of work still ongoing in this KEP from https://github.com/kubernetes/enhancements/pull/1884 Work in this area was done during the 1.20 and 1.21 cycles in https://github.com/kubernetes/kubernetes/pull/95734 Rationale, discussion and documentation for all the changes including the one proposed in this KEP have been described in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments and reported here were relevant Signed-off-by: Francesco Romani --- .../README.md | 275 ++++++++++++++++++ .../kep.yaml | 48 +++ 2 files changed, 323 insertions(+) create mode 100644 keps/sig-node/2403-pod-resources-allocatable-resources/README.md create mode 100644 keps/sig-node/2403-pod-resources-allocatable-resources/kep.yaml diff --git a/keps/sig-node/2403-pod-resources-allocatable-resources/README.md b/keps/sig-node/2403-pod-resources-allocatable-resources/README.md new file mode 100644 index 000000000000..d149c959600f --- /dev/null +++ b/keps/sig-node/2403-pod-resources-allocatable-resources/README.md @@ -0,0 +1,275 @@ +title: Extend kubelet pod resource assignment endpoint to return allocatable resources + +## Table of Contents + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Topology aware scheduling](#topology-aware-scheduling) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Proposed API](#proposed-api) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Alpha to Beta Graduation](#alpha-to-beta-graduation) + - [Beta to G.A Graduation](#beta-to-ga-graduation) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature enablement and rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Alternatives](#alternatives) + - [Add v1alpha1 Kubelet GRPC service, at /var/lib/kubelet/pod-resources/kubelet.sock, which returns a list of CreateContainerRequests used to create containers.](#add-v1alpha1-kubelet-grpc-service-at--which-returns-a-list-of-createcontainerrequests-used-to-create-containers) + - [Add a field to Pod Status.](#add-a-field-to-pod-status) + - [Use the Kubelet Device Manager Checkpoint file](#use-the-kubelet-device-manager-checkpoint-file) + - [Add a field to the Pod Spec:](#add-a-field-to-the-pod-spec) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/2403) +- [X] (R) KEP approvers have approved the KEP status as `implementable` +- [X] (R) Design details are appropriately documented +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [X] (R) Graduation criteria is in place +- [X] (R) Production readiness review completed +- [X] Production readiness review approved +- [X] "Implementation History" section is up-to-date for milestone +- ~~ [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] ~~ +- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This document presents an addition to the kubelet pod resources endpoint (pod resources API) which allows third party consumers to learn about the +compute device allocation, thus, alongside the existing pod resources API endpoint, properly evaluate the node capacity. + +## Motivation + +### Goals + +* Deprecate and remove current device-specific knowledge from the kubelet, such as [accelerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229) +* Enable external device monitoring agents to provide metrics relevant to Kubernetes + +## Proposal + +### User Stories + +#### Topology aware scheduling + +This interface can be used to track down allocated resources with information about the NUMA topology of the worker node in general way. +This interface can be used to the available resources on the worker node. The kubelet is the best source of information because it manages concrete resources assignment. The information can then be used in NUMA aware scheduling. + + +### Risks and Mitigations + +This API is read-only, which removes a large class of risks. The aspects that we consider below are as follows: +- What are the risks associated with the API service itself? +- What are the risks associated with the data itself? + +| Risk | Impact | Mitigation | +| --------------------------------------------------------- | ------------- | ---------- | +| Too many requests risk impacting the kubelet performances | High | Implement rate limiting and or passive caching, follow best practices for gRPC resource management. | +| Improper access to the data | Low | Server is listening on a root owned unix socket. This can be limited with proper pod security policies. | + + +## Design Details + +### Proposed API + +We propose to extend the existing pod resources gRPC service of the Kubelet, listening on a unix socket at `/var/lib/kubelet/pod-resources/kubelet.sock`. + +The GRPC Service will expose and additional endpoint: +- 'GetAllocatableResources`, which returns a single AllocatableResourcesResponse, enabling monitor applications to query for the allocatable set of resources available on the node. + +The extended interface is shown in proto below: +```protobuf +// PodResources is a service provided by the kubelet that provides information about the +// node resources consumed by pods and containers on the node +service PodResources { + rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} + rpc GetAllocatableResources(AllocatableResourcesRequest) returns (AllocatableResourcesResponse) {} + rpc Watch(WatchPodResourcesRequest) returns (stream WatchPodResourcesResponse) {} +} + +message AllocatableResourcesRequest {} + +// AvailableResourcesResponses contains informations about all the devices known by the kubelet +message AllocatableResourcesResponse { + repeated ContainerDevices devices = 1; + repeated int64 cpu_ids = 2; +} + +// ListPodResourcesRequest is the request made to the PodResources service +message ListPodResourcesRequest {} + +// ListPodResourcesResponse is the response returned by List function +message ListPodResourcesResponse { + repeated PodResources pod_resources = 1; +} + +// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service +message WatchPodResourcesRequest {} + +enum WatchPodAction { + ADDED = 0; + DELETED = 1; +} + +// WatchPodResourcesResponse is the response returned by Watch function +message WatchPodResourcesResponse { + WatchPodAction action = 1; + string uid = 2; + repeated PodResources pod_resources = 3; +} + +// PodResources contains information about the node resources assigned to a pod +message PodResources { + string name = 1; + string namespace = 2; + repeated ContainerResources containers = 3; +} + +// ContainerResources contains information about the resources assigned to a container +message ContainerResources { + string name = 1; + repeated ContainerDevices devices = 2; + repeated int64 cpu_ids = 3; +} + +// Topology describes hardware topology of the resource +message TopologyInfo { + repeated NUMANode nodes = 1; +} + +// NUMA representation of NUMA node +message NUMANode { + int64 ID = 1; +} + +// ContainerDevices contains information about the devices assigned to a container +message ContainerDevices { + string resource_name = 1; + repeated string device_ids = 2; + TopologyInfo topology = 3; +} +``` + +### Test Plan + +The implementation PR adds a suite of E2E tests which cover both the existing `List` endpoint already implemented in the podresources API and +the new proposed `GetAllocatableResources` API. + +### Graduation Criteria + +#### Alpha +- [X] Implement the new service API. +- [X] [Ensure proper e2e node tests are in place](https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=DevicePluginProbe). + +#### Alpha to Beta Graduation +- [X] Demonstrate that the endpoint can be used to replace in-tree GPU device metrics in production environments (NVIDIA, sig-node April 30, 2019). + +#### Beta to G.A Graduation +- [X] Multiple real world examples ([Multus CNI](https://github.com/intel/multus-cni)). +- [X] Allowing time for feedback (2 years). +- [X] [Start Deprecation of Accelerator metrics in kubelet](https://github.com/kubernetes/kubernetes/pull/91930). +- [X] Risks have been addressed. + +### Upgrade / Downgrade Strategy + +With gRPC the version is part of the service name. +Old versions and new versions should always be served and listened by the kubelet. + +To a cluster admin upgrading to the newest API version, means upgrading Kubernetes to a newer version as well as upgrading the monitoring component. + +To a vendor changes in the API should always be backwards compatible. + +### Version Skew Strategy + +Kubelet will always be backwards compatible, so going forward existing plugins are not expected to break. + +## Production Readiness Review Questionnaire +### Feature enablement and rollback + +* **How can this feature be enabled / disabled in a live cluster?** + - [X] Feature gate (also fill in values in `kep.yaml`). + - Feature gate name: `KubeletPodResources`. + - Components depending on the feature gate: N/A. + +* **Does enabling the feature change any default behavior?** No +* **Can the feature be disabled once it has been enabled (i.e. can we rollback the enablement)?** Yes, through feature gates. +* **What happens if we reenable the feature if it was previously rolled back?** The service recovers state from kubelet. +* **Are there any tests for feature enablement/disablement?** No, however no data is created or deleted. + +### Rollout, Upgrade and Rollback Planning + +* **How can a rollout fail? Can it impact already running workloads?** Kubelet would fail to start. Errors would be caught in the CI. +* **What specific metrics should inform a rollback?** Not Applicable, metrics wouldn't be available. +* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** Not Applicable. +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?** No. + +### Monitoring requirements +* **How can an operator determine if the feature is in use by workloads?** + - Look at the `pod_resources_endpoint_requests_total` metric exposed by the kubelet. + - Look at hostPath mounts of privileged containers. +* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** + - [X] Metrics + - Metric name: `pod_resources_endpoint_requests_total` + - Components exposing the metric: kubelet + +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** N/A or refer to Kubelet SLIs. +* **Are there any missing metrics that would be useful to have to improve observability if this feature?** No. + + +### Dependencies + +* **Does this feature depend on any specific services running in the cluster?** Not aplicable. + +### Scalability + +* **Will enabling / using this feature result in any new API calls?** No. +* **Will enabling / using this feature result in introducing new API types?** No. +* **Will enabling / using this feature result in any new calls to cloud provider?** No. +* **Will enabling / using this feature result in increasing size or count of the existing API objects?** No. +* **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs][]?** No. Feature is out of existing any paths in kubelet. +* **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?** In 1.18, DDOSing the API can lead to resource exhaustion. It is planned to be addressed as part of G.A. +Feature only collects data when requests comes in, data is then garbage collected. Data collected is proportional to the number of pods on the node. + +### Troubleshooting + +* **How does this feature react if the API server and/or etcd is unavailable?**: No effect. +* **What are other known failure modes?** No known failure modes +* **What steps should be taken if SLOs are not being met to determine the problem?** N/A + +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos + +## Implementation History + +- 2021-02-02: KEP extracted from [previous iteration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments) + +## Alternatives + +### Add a new endpoint +* Pros: + * No changes to existing APIs +* Cons: + * Requires the client to consume two APIs + * This work nicely fits in the boundaries and purpose of the podresources API + * The changes proposed in this KEP are very low-risk and backward compatible diff --git a/keps/sig-node/2403-pod-resources-allocatable-resources/kep.yaml b/keps/sig-node/2403-pod-resources-allocatable-resources/kep.yaml new file mode 100644 index 000000000000..1348655734ee --- /dev/null +++ b/keps/sig-node/2403-pod-resources-allocatable-resources/kep.yaml @@ -0,0 +1,48 @@ +title: Extend kubelet pod resource assignment endpoint to return allocatable resources +kep-number: 2403 +authors: + - "@dashpole" + - "@vikaschoudhary16" + - "@renaudwastaken" + - "@fromanirh" + - "@alexeyperevalov" +owning-sig: sig-node +participating-sigs: [] +status: implementable +creation-date: "2021-02-02" +reviewers: + - "@derekwaynecarr" + - "@renaudwastaken" +approvers: + - "@sig-node-leads" +prr-approvers: [] +see-also: + - "keps/sig-node/606-compute-device-assignment/" + - "keps/sig-node/2043-pod-resource-concrete-assigments/" +replaces: [] + +# The target maturity stage in the current dev cycle for this KEP. +stage: stable + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.21" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + # alpha: "v1.13" + # beta: "v1.15" + stable: "v1.21" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: "KubeletPodResources" + components: + - kubelet +disable-supported: false + +# The following PRR answers are required at beta release +metrics: + - pod_resources_endpoint_requests_total