Skip to content

Commit 098315d

Browse files
authored
Merge pull request #781 from lmdaly/kep-topology-manager
Topology Manager KEP (Moving to new repo)
2 parents b2ed057 + 0837479 commit 098315d

File tree

1 file changed

+349
-0
lines changed

1 file changed

+349
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,349 @@
1+
---
2+
kep-number: 35
3+
title: Node Topology Manager
4+
authors:
5+
- "@ConnorDoyle"
6+
- "@balajismaniam"
7+
- "@lmdaly"
8+
owning-sig: sig-node
9+
participating-sigs:
10+
- sig-node
11+
reviewers:
12+
- "@vikasc"
13+
- "@derekwaynecarr"
14+
- "@jeremyeder"
15+
- "@RenaudWasTaken"
16+
approvers:
17+
- "@dawnchen"
18+
- "@derekwaynecarr"
19+
editor: Louise Daly
20+
creation-date: 2019-01-30
21+
last-updated: 2019-01-30
22+
status: implementable
23+
see-also:
24+
replaces:
25+
superseded-by:
26+
---
27+
28+
# Node Topology Manager
29+
30+
_Authors:_
31+
32+
* @ConnorDoyle - Connor Doyle <[email protected]>
33+
* @balajismaniam - Balaji Subramaniam <[email protected]>
34+
* @lmdaly - Louise M. Daly <[email protected]>
35+
36+
**Contents:**
37+
38+
* [Overview](#overview)
39+
* [Motivation](#motivation)
40+
* [Goals](#goals)
41+
* [Non-Goals](#non-goals)
42+
* [User Stories](#user-stories)
43+
* [Proposal](#proposal)
44+
* [User Stories](#user-stories)
45+
* [Proposed Changes](#proposed-changes)
46+
* [New Component: Topology Manager](#new-component-topology-manager)
47+
* [Computing Preferred Affinity](#computing-preferred-affinity)
48+
* [New Interfaces](#new-interfaces)
49+
* [Changes to Existing Components](#changes-to-existing-components)
50+
* [Graduation Criteria](#graduation-criteria)
51+
* [alpha (target v1.11)](#alpha-target-v1.11)
52+
* [beta](#beta)
53+
* [GA (stable)](#ga-stable)
54+
* [Challenges](#challenges)
55+
* [Limitations](#limitations)
56+
* [Alternatives](#alternatives)
57+
* [Reference](#reference)
58+
59+
# Overview
60+
61+
An increasing number of systems leverage a combination of CPUs and
62+
hardware accelerators to support latency-critical execution and
63+
high-throughput parallel computation. These include workloads in fields
64+
such as telecommunications, scientific computing, machine learning,
65+
financial services and data analytics. Such hybrid systems comprise a
66+
high performance environment.
67+
68+
In order to extract the best performance, optimizations related to CPU
69+
isolation and memory and device locality are required. However, in
70+
Kubernetes, these optimizations are handled by a disjoint set of
71+
components.
72+
73+
This proposal provides a mechanism to coordinate fine-grained hardware
74+
resource assignments for different components in Kubernetes.
75+
76+
# Motivation
77+
78+
Multiple components in the Kubelet make decisions about system
79+
topology-related assignments:
80+
81+
- CPU manager
82+
- The CPU manager makes decisions about the set of CPUs a container is
83+
allowed to run on. The only implemented policy as of v1.8 is the static
84+
one, which does not change assignments for the lifetime of a container.
85+
- Device manager
86+
- The device manager makes concrete device assignments to satisfy
87+
container resource requirements. Generally devices are attached to one
88+
peripheral interconnect. If the device manager and the CPU manager are
89+
misaligned, all communication between the CPU and the device can incur
90+
an additional hop over the processor interconnect fabric.
91+
- Container Network Interface (CNI)
92+
- NICs including SR-IOV Virtual Functions have affinity to one socket,
93+
with measurable performance ramifications.
94+
95+
*Related Issues:*
96+
97+
- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964]
98+
- [Discover nodes with NUMA architecture][nfd-issue-84]
99+
- [Support VF interrupt binding to specified CPU][sriov-issue-10]
100+
- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity]
101+
102+
Note that all of these concerns pertain only to multi-socket systems. Correct
103+
behavior requires that the kernel receive accurate topology information from
104+
the underlying hardware (typically via the SLIT table). See section 5.2.16
105+
and 5.2.17 of the
106+
[ACPI Specification](http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf) for more
107+
information.
108+
109+
## Goals
110+
111+
- Arbitrate preferred socket affinity for containers based on input from
112+
CPU manager and Device Manager.
113+
- Provide an internal interface and pattern to integrate additional
114+
topology-aware Kubelet components.
115+
116+
## Non-Goals
117+
118+
- _Inter-device connectivity:_ Decide device assignments based on direct
119+
device interconnects. This issue can be separated from socket
120+
locality. Inter-device topology can be considered entirely within the
121+
scope of the Device Manager, after which it can emit possible
122+
socket affinities. The policy to reach that decision can start simple
123+
and iterate to include support for arbitrary inter-device graphs.
124+
- _HugePages:_ This proposal assumes that pre-allocated HugePages are
125+
spread among the available memory nodes in the system. We further assume
126+
the operating system provides best-effort local page allocation for
127+
containers (as long as sufficient HugePages are free on the local memory
128+
node.
129+
- _CNI:_ Changing the Container Networking Interface is out of scope for
130+
this proposal. However, this design should be extensible enough to
131+
accommodate network interface locality if the CNI adds support in the
132+
future. This limitation is potentially mitigated by the possibility to
133+
use the device plugin API as a stopgap solution for specialized
134+
networking requirements.
135+
136+
## User Stories
137+
138+
*Story 1: Fast virtualized network functions*
139+
140+
A user asks for a "fast network" and automatically gets all the various
141+
pieces coordinated (hugepages, cpusets, network device) co-located on a
142+
socket.
143+
144+
*Story 2: Accelerated neural network training*
145+
146+
A user asks for an accelerator device and some number of exclusive CPUs
147+
in order to get the best training performance, due to socket-alignment of
148+
the assigned CPUs and devices.
149+
150+
# Proposal
151+
152+
*Main idea: Two phase topology coherence protocol*
153+
154+
Topology affinity is tracked at the container level, similar to devices and
155+
CPU affinity. At pod admission time, a new component called the Topology
156+
Manager collects possible configurations from the Device Manager and the
157+
CPU Manager. The Topology Manager acts as an oracle for local alignment by
158+
those same components when they make concrete resource allocations. We
159+
expect the consulted components to use the inferred QoS class of each
160+
pod in order to prioritize the importance of fulfilling optimal locality.
161+
162+
## Proposed Changes
163+
164+
### New Component: Topology Manager
165+
166+
This proposal is focused on a new component in the Kubelet called the
167+
Topology Manager. The Topology Manager implements the pod admit handler
168+
interface and participates in Kubelet pod admission. When the `Admit()`
169+
function is called, the Topology Manager collects topology hints from other
170+
Kubelet components.
171+
172+
If the hints are not compatible, the Topology Manager may choose to
173+
reject the pod. Behavior in this case depends on a new Kubelet configuration
174+
value to choose the topology policy. The Topology Manager supports two
175+
modes: `strict` and `preferred` (default). In `strict` mode, the pod is
176+
rejected if alignment cannot be satisfied. The Topology Manager could
177+
use `softAdmitHandler` to keep the pod in `Pending` state.
178+
179+
The Topology Manager component will be disabled behind a feature gate until
180+
graduation from alpha to beta.
181+
182+
#### Computing Preferred Affinity
183+
184+
A topology hint indicates a preference for some well-known local resources.
185+
Initially, the only supported reference resource is a mask of CPU socket IDs.
186+
After collecting hints from all providers, the Topology Manager chooses some
187+
mask that is present in all lists. Here is a sketch:
188+
189+
1. Apply a partial order on each list: number of bits set in the
190+
mask, ascending. This biases the result to be more precise if
191+
possible.
192+
1. Iterate over the permutations of preference lists and compute
193+
bitwise-and over the masks in each permutation.
194+
1. Store the first non-empty result and break out early.
195+
1. If no non-empty result exists, return an error.
196+
197+
The behavior when a match does not exist is configurable, as described
198+
above.
199+
200+
#### New Interfaces
201+
202+
```go
203+
package topologymanager
204+
205+
// TopologyManager helps to coordinate local resource alignment
206+
// within the Kubelet.
207+
type Manager interface {
208+
lifecycle.PodAdmitHandler
209+
Store
210+
AddHintProvider(HintProvider)
211+
RemovePod(podName string)
212+
}
213+
214+
// SocketMask is a bitmask-like type denoting a subset of available sockets.
215+
type SocketMask struct{} // TBD
216+
217+
// TopologyHints encodes locality to local resources.
218+
type TopologyHints struct {
219+
Sockets []SocketMask
220+
}
221+
222+
// HintStore manages state related to the Topology Manager.
223+
type Store interface {
224+
// GetAffinity returns the preferred affinity for the supplied
225+
// pod and container.
226+
GetAffinity(podName string, containerName string) TopologyHints
227+
}
228+
229+
// HintProvider is implemented by Kubelet components that make
230+
// topology-related resource assignments. The Topology Manager consults each
231+
// hint provider at pod admission time.
232+
type HintProvider interface {
233+
// Returns hints if this hint provider has a preference; otherwise
234+
// returns `_, false` to indicate "don't care".
235+
GetTopologyHints(pod v1.Pod, containerName string) (TopologyHints, bool)
236+
}
237+
```
238+
239+
_Listing: Topology Manager and related interfaces (sketch)._
240+
241+
![topology-manager-components](https://user-images.githubusercontent.com/379372/47447523-8efd2b00-d772-11e8-924d-eea5a5e00037.png)
242+
243+
_Figure: Topology Manager components._
244+
245+
![topology-manager-instantiation](https://user-images.githubusercontent.com/379372/47447526-945a7580-d772-11e8-9761-5213d745e852.png)
246+
247+
_Figure: Topology Manager instantiation and inclusion in pod admit lifecycle._
248+
249+
### Changes to Existing Components
250+
251+
1. Kubelet consults Topology Manager for pod admission (discussed above.)
252+
1. Add two implementations of Topology Manager interface and a feature gate.
253+
1. As much Topology Manager functionality as possible is stubbed when the
254+
feature gate is disabled.
255+
1. Add a functional Topology Manager that queries hint providers in order
256+
to compute a preferred socket mask for each container.
257+
1. Add `GetTopologyHints()` method to CPU Manager.
258+
1. CPU Manager static policy calls `GetAffinity()` method of
259+
Topology Manager when deciding CPU affinity.
260+
1. Add `GetTopologyHints()` method to Device Manager.
261+
1. Add Socket ID to Device structure in the device plugin
262+
interface. Plugins should be able to determine the socket
263+
when enumerating supported devices. See the protocol diff below.
264+
1. Device Manager calls `GetAffinity()` method of Topology Manager when
265+
deciding device allocation.
266+
267+
```diff
268+
diff --git a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
269+
index efbd72c133..f86a1a5512 100644
270+
--- a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
271+
+++ b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
272+
@@ -73,6 +73,10 @@ message ListAndWatchResponse {
273+
repeated Device devices = 1;
274+
}
275+
276+
+message TopologyInfo {
277+
+ optional int32 socketID = 1 [default = -1];
278+
+}
279+
+
280+
/* E.g:
281+
* struct Device {
282+
* ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e",
283+
@@ -85,6 +89,8 @@ message Device {
284+
string ID = 1;
285+
// Health of the device, can be healthy or unhealthy, see constants.go
286+
string health = 2;
287+
+ // Topology details of the device (optional.)
288+
+ optional TopologyInfo topology = 3;
289+
}
290+
```
291+
292+
_Listing: Amended device plugin gRPC protocol._
293+
294+
![topology-manager-wiring](https://user-images.githubusercontent.com/379372/47447533-9a505680-d772-11e8-95ca-ef9a8290a46a.png)
295+
296+
_Figure: Topology Manager hint provider registration._
297+
298+
![topology-manager-hints](https://user-images.githubusercontent.com/379372/47447543-a0463780-d772-11e8-8412-8bf4a0571513.png)
299+
300+
_Figure: Topology Manager fetches affinity from hint providers._
301+
302+
# Graduation Criteria
303+
304+
## Phase 1: Alpha (target v1.13)
305+
306+
* Feature gate is disabled by default.
307+
* Alpha-level documentation.
308+
* Unit test coverage.
309+
* Node e2e tests.
310+
* CPU Manager allocation policy takes topology hints into account.
311+
* Device plugin interface includes socket ID.
312+
* Device Manager allocation policy takes topology hints into account.
313+
314+
## Phase 2: Beta (later versions)
315+
316+
* Feature gate is enabled by default.
317+
* Alpha-level documentation.
318+
* Support hugepages alignment.
319+
* User feedback.
320+
321+
## GA (stable)
322+
323+
* *TBD*
324+
325+
# Challenges
326+
327+
* Testing the Topology Manager in a continuous integration environment
328+
depends on cloud infrastructure to expose multi-node topologies
329+
to guest virtual machines.
330+
* Implementing the `GetHints()` interface may prove challenging.
331+
332+
# Limitations
333+
334+
* *TBD*
335+
336+
# Alternatives
337+
338+
* [AutoNUMA][numa-challenges]: This kernel feature affects memory
339+
allocation and thread scheduling, but does not address device locality.
340+
341+
# References
342+
343+
* *TBD*
344+
345+
[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964
346+
[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84
347+
[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10
348+
[proposal-affinity]: https://github.com/kubernetes/community/pull/171
349+
[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078

0 commit comments

Comments
 (0)