Skip to content

Commit 6e2ce8b

Browse files
authored
Merge pull request kubernetes#1680 from ConnorDoyle/numa-manager
Add Topology Manager proposal.
2 parents e19cf5b + dc496c4 commit 6e2ce8b

File tree

1 file changed

+322
-0
lines changed

1 file changed

+322
-0
lines changed

node/topology-manager.md

+322
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
# Node Topology Manager
2+
3+
_Authors:_
4+
5+
* @ConnorDoyle - Connor Doyle <[email protected]>
6+
* @balajismaniam - Balaji Subramaniam <[email protected]>
7+
* @lmdaly - Louise M. Daly <[email protected]>
8+
9+
**Contents:**
10+
11+
* [Overview](#overview)
12+
* [Motivation](#motivation)
13+
* [Goals](#goals)
14+
* [Non-Goals](#non-goals)
15+
* [User Stories](#user-stories)
16+
* [Proposal](#proposal)
17+
* [User Stories](#user-stories)
18+
* [Proposed Changes](#proposed-changes)
19+
* [New Component: Topology Manager](#new-component-topology-manager)
20+
* [Computing Preferred Affinity](#computing-preferred-affinity)
21+
* [New Interfaces](#new-interfaces)
22+
* [Changes to Existing Components](#changes-to-existing-components)
23+
* [Graduation Criteria](#graduation-criteria)
24+
* [alpha (target v1.11)](#alpha-target-v1.11)
25+
* [beta](#beta)
26+
* [GA (stable)](#ga-stable)
27+
* [Challenges](#challenges)
28+
* [Limitations](#limitations)
29+
* [Alternatives](#alternatives)
30+
* [Reference](#reference)
31+
32+
# Overview
33+
34+
An increasing number of systems leverage a combination of CPUs and
35+
hardware accelerators to support latency-critical execution and
36+
high-throughput parallel computation. These include workloads in fields
37+
such as telecommunications, scientific computing, machine learning,
38+
financial services and data analytics. Such hybrid systems comprise a
39+
high performance environment.
40+
41+
In order to extract the best performance, optimizations related to CPU
42+
isolation and memory and device locality are required. However, in
43+
Kubernetes, these optimizations are handled by a disjoint set of
44+
components.
45+
46+
This proposal provides a mechanism to coordinate fine-grained hardware
47+
resource assignments for different components in Kubernetes.
48+
49+
# Motivation
50+
51+
Multiple components in the Kubelet make decisions about system
52+
topology-related assignments:
53+
54+
- CPU manager
55+
- The CPU manager makes decisions about the set of CPUs a container is
56+
allowed to run on. The only implemented policy as of v1.8 is the static
57+
one, which does not change assignments for the lifetime of a container.
58+
- Device manager
59+
- The device manager makes concrete device assignments to satisfy
60+
container resource requirements. Generally devices are attached to one
61+
peripheral interconnect. If the device manager and the CPU manager are
62+
misaligned, all communication between the CPU and the device can incur
63+
an additional hop over the processor interconnect fabric.
64+
- Container Network Interface (CNI)
65+
- NICs including SR-IOV Virtual Functions have affinity to one socket,
66+
with measurable performance ramifications.
67+
68+
*Related Issues:*
69+
70+
- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964]
71+
- [Discover nodes with NUMA architecture][nfd-issue-84]
72+
- [Support VF interrupt binding to specified CPU][sriov-issue-10]
73+
- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity]
74+
75+
Note that all of these concerns pertain only to multi-socket systems. Correct
76+
behavior requires that the kernel receive accurate topology information from
77+
the underlying hardware (typically via the SLIT table). See section 5.2.16
78+
and 5.2.17 of the
79+
[ACPI Specification](http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf) for more
80+
information.
81+
82+
## Goals
83+
84+
- Arbitrate preferred socket affinity for containers based on input from
85+
CPU manager and Device Manager.
86+
- Provide an internal interface and pattern to integrate additional
87+
topology-aware Kubelet components.
88+
89+
## Non-Goals
90+
91+
- _Inter-device connectivity:_ Decide device assignments based on direct
92+
device interconnects. This issue can be separated from socket
93+
locality. Inter-device topology can be considered entirely within the
94+
scope of the Device Manager, after which it can emit possible
95+
socket affinities. The policy to reach that decision can start simple
96+
and iterate to include support for arbitrary inter-device graphs.
97+
- _HugePages:_ This proposal assumes that pre-allocated HugePages are
98+
spread among the available memory nodes in the system. We further assume
99+
the operating system provides best-effort local page allocation for
100+
containers (as long as sufficient HugePages are free on the local memory
101+
node.
102+
- _CNI:_ Changing the Container Networking Interface is out of scope for
103+
this proposal. However, this design should be extensible enough to
104+
accommodate network interface locality if the CNI adds support in the
105+
future. This limitation is potentially mitigated by the possibility to
106+
use the device plugin API as a stopgap solution for specialized
107+
networking requirements.
108+
109+
## User Stories
110+
111+
*Story 1: Fast virtualized network functions*
112+
113+
A user asks for a "fast network" and automatically gets all the various
114+
pieces coordinated (hugepages, cpusets, network device) co-located on a
115+
socket.
116+
117+
*Story 2: Accelerated neural network training*
118+
119+
A user asks for an accelerator device and some number of exclusive CPUs
120+
in order to get the best training performance, due to socket-alignment of
121+
the assigned CPUs and devices.
122+
123+
# Proposal
124+
125+
*Main idea: Two phase topology coherence protocol*
126+
127+
Topology affinity is tracked at the container level, similar to devices and
128+
CPU affinity. At pod admission time, a new component called the Topology
129+
Manager collects possible configurations from the Device Manager and the
130+
CPU Manager. The Topology Manager acts as an oracle for local alignment by
131+
those same components when they make concrete resource allocations. We
132+
expect the consulted components to use the inferred QoS class of each
133+
pod in order to prioritize the importance of fulfilling optimal locality.
134+
135+
## Proposed Changes
136+
137+
### New Component: Topology Manager
138+
139+
This proposal is focused on a new component in the Kubelet called the
140+
Topology Manager. The Topology Manager implements the pod admit handler
141+
interface and participates in Kubelet pod admission. When the `Admit()`
142+
function is called, the Topology Manager collects topology hints from other
143+
Kubelet components.
144+
145+
If the hints are not compatible, the Topology Manager may choose to
146+
reject the pod. Behavior in this case depends on a new Kubelet configuration
147+
value to choose the topology policy. The Topology Manager supports two
148+
modes: `strict` and `preferred` (default). In `strict` mode, the pod is
149+
rejected if alignment cannot be satisfied. The Topology Manager could
150+
use `softAdmitHandler` to keep the pod in `Pending` state.
151+
152+
The Topology Manager component will be disabled behind a feature gate until
153+
graduation from alpha to beta.
154+
155+
#### Computing Preferred Affinity
156+
157+
A topology hint indicates a preference for some well-known local resources.
158+
Initially, the only supported reference resource is a mask of CPU socket IDs.
159+
After collecting hints from all providers, the Topology Manager chooses some
160+
mask that is present in all lists. Here is a sketch:
161+
162+
1. Apply a partial order on each list: number of bits set in the
163+
mask, ascending. This biases the result to be more precise if
164+
possible.
165+
1. Iterate over the permutations of preference lists and compute
166+
bitwise-and over the masks in each permutation.
167+
1. Store the first non-empty result and break out early.
168+
1. If no non-empty result exists, return an error.
169+
170+
The behavior when a match does not exist is configurable, as described
171+
above.
172+
173+
#### New Interfaces
174+
175+
```go
176+
package topologymanager
177+
178+
// TopologyManager helps to coordinate local resource alignment
179+
// within the Kubelet.
180+
type Manager interface {
181+
lifecycle.PodAdmitHandler
182+
Store
183+
AddHintProvider(HintProvider)
184+
RemovePod(podName string)
185+
}
186+
187+
// SocketMask is a bitmask-like type denoting a subset of available sockets.
188+
type SocketMask struct{} // TBD
189+
190+
// TopologyHints encodes locality to local resources.
191+
type TopologyHints struct {
192+
Sockets []SocketMask
193+
}
194+
195+
// HintStore manages state related to the Topology Manager.
196+
type Store interface {
197+
// GetAffinity returns the preferred affinity for the supplied
198+
// pod and container.
199+
GetAffinity(podName string, containerName string) TopologyHints
200+
}
201+
202+
// HintProvider is implemented by Kubelet components that make
203+
// topology-related resource assignments. The Topology Manager consults each
204+
// hint provider at pod admission time.
205+
type HintProvider interface {
206+
// Returns hints if this hint provider has a preference; otherwise
207+
// returns `_, false` to indicate "don't care".
208+
GetTopologyHints(pod v1.Pod, containerName string) (TopologyHints, bool)
209+
}
210+
```
211+
212+
_Listing: Topology Manager and related interfaces (sketch)._
213+
214+
![topology-manager-components](https://user-images.githubusercontent.com/379372/47447523-8efd2b00-d772-11e8-924d-eea5a5e00037.png)
215+
216+
_Figure: Topology Manager components._
217+
218+
![topology-manager-instantiation](https://user-images.githubusercontent.com/379372/47447526-945a7580-d772-11e8-9761-5213d745e852.png)
219+
220+
_Figure: Topology Manager instantiation and inclusion in pod admit lifecycle._
221+
222+
### Changes to Existing Components
223+
224+
1. Kubelet consults Topology Manager for pod admission (discussed above.)
225+
1. Add two implementations of Topology Manager interface and a feature gate.
226+
1. As much Topology Manager functionality as possible is stubbed when the
227+
feature gate is disabled.
228+
1. Add a functional Topology Manager that queries hint providers in order
229+
to compute a preferred socket mask for each container.
230+
1. Add `GetTopologyHints()` method to CPU Manager.
231+
1. CPU Manager static policy calls `GetAffinity()` method of
232+
Topology Manager when deciding CPU affinity.
233+
1. Add `GetTopologyHints()` method to Device Manager.
234+
1. Add Socket ID to Device structure in the device plugin
235+
interface. Plugins should be able to determine the socket
236+
when enumerating supported devices. See the protocol diff below.
237+
1. Device Manager calls `GetAffinity()` method of Topology Manager when
238+
deciding device allocation.
239+
240+
```diff
241+
diff --git a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
242+
index efbd72c133..f86a1a5512 100644
243+
--- a/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
244+
+++ b/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
245+
@@ -73,6 +73,10 @@ message ListAndWatchResponse {
246+
repeated Device devices = 1;
247+
}
248+
249+
+message TopologyInfo {
250+
+ optional int32 socketID = 1 [default = -1];
251+
+}
252+
+
253+
/* E.g:
254+
* struct Device {
255+
* ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e",
256+
@@ -85,6 +89,8 @@ message Device {
257+
string ID = 1;
258+
// Health of the device, can be healthy or unhealthy, see constants.go
259+
string health = 2;
260+
+ // Topology details of the device (optional.)
261+
+ optional TopologyInfo topology = 3;
262+
}
263+
```
264+
265+
_Listing: Amended device plugin gRPC protocol._
266+
267+
![topology-manager-wiring](https://user-images.githubusercontent.com/379372/47447533-9a505680-d772-11e8-95ca-ef9a8290a46a.png)
268+
269+
_Figure: Topology Manager hint provider registration._
270+
271+
![topology-manager-hints](https://user-images.githubusercontent.com/379372/47447543-a0463780-d772-11e8-8412-8bf4a0571513.png)
272+
273+
_Figure: Topology Manager fetches affinity from hint providers._
274+
275+
# Graduation Criteria
276+
277+
## Phase 1: Alpha (target v1.13)
278+
279+
* Feature gate is disabled by default.
280+
* Alpha-level documentation.
281+
* Unit test coverage.
282+
* CPU Manager allocation policy takes topology hints into account.
283+
* Device plugin interface includes socket ID.
284+
* Device Manager allocation policy takes topology hints into account.
285+
286+
## Phase 2: Beta (later versions)
287+
288+
* Feature gate is enabled by default.
289+
* Alpha-level documentation.
290+
* Node e2e tests.
291+
* Support hugepages alignment.
292+
* User feedback.
293+
294+
## GA (stable)
295+
296+
* *TBD*
297+
298+
# Challenges
299+
300+
* Testing the Topology Manager in a continuous integration environment
301+
depends on cloud infrastructure to expose multi-node topologies
302+
to guest virtual machines.
303+
* Implementing the `GetHints()` interface may prove challenging.
304+
305+
# Limitations
306+
307+
* *TBD*
308+
309+
# Alternatives
310+
311+
* [AutoNUMA][numa-challenges]: This kernel feature affects memory
312+
allocation and thread scheduling, but does not address device locality.
313+
314+
# References
315+
316+
* *TBD*
317+
318+
[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964
319+
[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84
320+
[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10
321+
[proposal-affinity]: https://github.com/kubernetes/community/pull/171
322+
[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078

0 commit comments

Comments
 (0)