Skip to content

Commit

Permalink
Added documentation to support Topology Manager feature in Kubelet.
Browse files Browse the repository at this point in the history
	* Added new document outlining feature
        * Updated feature-gates.md to include feature gate for feature
        * Update kubelet.md to include kubelet flags for feature
        * Added Topology Manager reference to relevant pages

Co-authored-by: Tim Bannister <[email protected]>
  • Loading branch information
lmdaly and sftim committed Aug 30, 2019
1 parent aa4b72c commit f8d1fba
Show file tree
Hide file tree
Showing 7 changed files with 185 additions and 2 deletions.
11 changes: 11 additions & 0 deletions content/en/docs/concepts/architecture/nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,13 @@ includes all containers started by the kubelet, but not containers started direc
If you want to explicitly reserve resources for non-Pod processes, follow this tutorial to
[reserve resources for system daemons](/docs/tasks/administer-cluster/reserve-compute-resources/#system-reserved).

## Node topology

{{< feature-state state="alpha" >}}

If you have enabled the `TopologyManager`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/), then
the kubelet can use topology hints when making resource assignment decisions.

## API Object

Expand All @@ -299,3 +306,7 @@ API object can be found at:
[Node API object](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#node-v1-core).

{{% /capture %}}
{{% capture whatsnext %}}
* Read about [node components](https://kubernetes.io/docs/concepts/overview/components/#node-components)
* Read about node-level topology: [Control Topology Management Policies on a node](/docs/tasks/administer-cluster/topology-manager/)
{{% /capture %}}
6 changes: 6 additions & 0 deletions content/en/docs/concepts/configuration/assign-pod-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,4 +397,10 @@ The design documents for
[node affinity](https://git.k8s.io/community/contributors/design-proposals/scheduling/nodeaffinity.md)
and for [inter-pod affinity/anti-affinity](https://git.k8s.io/community/contributors/design-proposals/scheduling/podaffinity.md) contain extra background information about these features.

Once a Pod is assigned to a Node, the kubelet runs the Pod and allocates node-local resources.
The [topology manager](/docs/tasks/administer-cluster/topology-manager/) can take part in node-level
resource allocation decisions.

[Topology Manager](/docs/concepts/architecture/nodes/#node-topology) feature state.

{{% /capture %}}
3 changes: 2 additions & 1 deletion content/en/docs/concepts/scheduling/kube-scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,5 +182,6 @@ kube-scheduler has a default set of scheduling policies.
{{% capture whatsnext %}}
* Read about [scheduler performance tuning](/docs/concepts/scheduling/scheduler-perf-tuning/)
* Read the [reference documentation](/docs/reference/command-line-tools-reference/kube-scheduler/) for kube-scheduler
* Learn about [configuring multiple schedulers](https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/)
* Learn about [configuring multiple schedulers](/docs/tasks/administer-cluster/configure-multiple-schedulers/)
* Learn about [topology management policies](/docs/tasks/administer-cluster/topology-manager/)
{{% /capture %}}
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ different Kubernetes components.
| `TokenRequestProjection` | `false` | Alpha | 1.11 | 1.11 |
| `TokenRequestProjection` | `true` | Beta | 1.12 | |
| `TTLAfterFinished` | `false` | Alpha | 1.12 | |
| `TopologyManager` | `false` | Alpha | 1.16 | |
| `VolumePVCDataSource` | `false` | Alpha | 1.15 | |
| `VolumeScheduling` | `false` | Alpha | 1.9 | 1.9 |
| `VolumeScheduling` | `true` | Beta | 1.10 | 1.12 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -548,7 +548,7 @@ kubelet [flags]
<td colspan="2">--feature-gates mapStringBool</td>
</tr>
<tr>
<td></td><td style="line-height: 130%; word-wrap: break-word;">A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:<br/>APIListChunking=true|false (BETA - default=true)<br/>APIResponseCompression=true|false (ALPHA - default=false)<br/>Accelerators=true|false<br/>AdvancedAuditing=true|false (BETA - default=true)<br/>AllAlpha=true|false (ALPHA - default=false)<br/>AllowExtTrafficLocalEndpoints=true|false<br/>AppArmor=true|false (BETA - default=true)<br/>BlockVolume=true|false (ALPHA - default=false)<br/>CPUManager=true|false (BETA - default=true)<br/>CSIPersistentVolume=true|false (ALPHA - default=false)<br/>CustomPodDNS=true|false (ALPHA - default=false)<br/>CustomResourceValidation=true|false (BETA - default=true)<br/>DebugContainers=true|false <br/>DevicePlugins=true|false (ALPHA - default=false)<br/>DynamicKubeletConfig=true|false (ALPHA - default=false)<br/>EnableEquivalenceClassCache=true|false (ALPHA - default=false)<br/>ExpandPersistentVolumes=true|false (ALPHA - default=false)<br/>ExperimentalCriticalPodAnnotation=true|false (ALPHA - default=false)<br/>ExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false)<br/>HugePages=true|false (ALPHA - default=false)<br/>Initializers=true|false (ALPHA - default=false)<br/>KubeletConfigFile=true|false (ALPHA - default=false)<br/>LocalStorageCapacityIsolation=true|false (ALPHA - default=false)<br/>LocalStorageCapacityIsolationFSQuotaMonitoring=true|false (ALPHA - default=false)<br/>MountContainers=true|false (ALPHA - default=false)<br/>MountPropagation=true|false (ALPHA - default=false)<br/>PVCProtection=true|false (ALPHA - default=false)<br/>PersistentLocalVolumes=true|false (ALPHA - default=false)<br/>PodPriority=true|false (ALPHA - default=false)<br/>ReadOnlyAPIDataVolumes=true|false<br/>ResourceLimitsPriorityFunction=true|false (ALPHA - default=false)<br/>RotateKubeletClientCertificate=true|false (BETA - default=true)<br/>RotateKubeletServerCertificate=true|false (ALPHA - default=false)<br/>ServiceNodeExclusion=true|false (ALPHA - default=false)<br/>ServiceProxyAllowExternalIPs=true|false<br/>StreamingProxyRedirects=true|false (BETA - default=true)<br/>SupportIPVSProxyMode=true|false (ALPHA - default=false)<br/>SupportNodePidsLimit=true|false (BETA - default=true)<br/>TaintBasedEvictions=true|false (BETA - default=true)<br/>TaintNodesByCondition=true|false (BETA - default=true)<br/>VolumeScheduling=true|false (ALPHA - default=false)<br/>VolumeSubpath=true|false<br/>
<td></td><td style="line-height: 130%; word-wrap: break-word;">A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:<br/>APIListChunking=true|false (BETA - default=true)<br/>APIResponseCompression=true|false (ALPHA - default=false)<br/>Accelerators=true|false<br/>AdvancedAuditing=true|false (BETA - default=true)<br/>AllAlpha=true|false (ALPHA - default=false)<br/>AllowExtTrafficLocalEndpoints=true|false<br/>AppArmor=true|false (BETA - default=true)<br/>BlockVolume=true|false (ALPHA - default=false)<br/>CPUManager=true|false (BETA - default=true)<br/>CSIPersistentVolume=true|false (ALPHA - default=false)<br/>CustomPodDNS=true|false (ALPHA - default=false)<br/>CustomResourceValidation=true|false (BETA - default=true)<br/>DebugContainers=true|false <br/>DevicePlugins=true|false (ALPHA - default=false)<br/>DynamicKubeletConfig=true|false (ALPHA - default=false)<br/>EnableEquivalenceClassCache=true|false (ALPHA - default=false)<br/>ExpandPersistentVolumes=true|false (ALPHA - default=false)<br/>ExperimentalCriticalPodAnnotation=true|false (ALPHA - default=false)<br/>ExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false)<br/>HugePages=true|false (ALPHA - default=false)<br/>Initializers=true|false (ALPHA - default=false)<br/>KubeletConfigFile=true|false (ALPHA - default=false)<br/>LocalStorageCapacityIsolation=true|false (ALPHA - default=false)<br/>LocalStorageCapacityIsolationFSQuotaMonitoring=true|false (ALPHA - default=false)<br/>MountContainers=true|false (ALPHA - default=false)<br/>MountPropagation=true|false (ALPHA - default=false)<br/>PVCProtection=true|false (ALPHA - default=false)<br/>PersistentLocalVolumes=true|false (ALPHA - default=false)<br/>PodPriority=true|false (ALPHA - default=false)<br/>ReadOnlyAPIDataVolumes=true|false<br/>ResourceLimitsPriorityFunction=true|false (ALPHA - default=false)<br/>RotateKubeletClientCertificate=true|false (BETA - default=true)<br/>RotateKubeletServerCertificate=true|false (ALPHA - default=false)<br/>ServiceNodeExclusion=true|false (ALPHA - default=false)<br/>ServiceProxyAllowExternalIPs=true|false<br/>StreamingProxyRedirects=true|false (BETA - default=true)<br/>SupportIPVSProxyMode=true|false (ALPHA - default=false)<br/>SupportNodePidsLimit=true|false (BETA - default=true)<br/>TaintBasedEvictions=true|false (BETA - default=true)<br/>TaintNodesByCondition=true|false (BETA - default=true)<br/>TopologyManager=true|false (ALPHA - default=false)<br/>VolumeScheduling=true|false (ALPHA - default=false)<br/>VolumeSubpath=true|false<br/>
</td>
</tr>

Expand Down Expand Up @@ -1132,6 +1132,13 @@ kubelet [flags]
<tr>
<td></td><td style="line-height: 130%; word-wrap: break-word;">File containing x509 private key matching --tls-cert-file.</td>
</tr>

<tr>
<td colspan="2">--topology-manager-policy string</td>
</tr>
<tr>
<td></td><td style="line-height: 130%; word-wrap: break-word;">Topology Manager Policy to use. Possible values: `none`, `best-effort`, `restricted`, `single-numa-node` (Default `none`)</td>
</tr>

<tr>
<td colspan="2">-v, --v Level</td>
Expand Down
154 changes: 154 additions & 0 deletions content/en/docs/tasks/administer-cluster/topology-manager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
title: Control Topology Management Policies on a node
reviewers:
- ConnorDoyle
- klueska
- lmdaly
- nolancon

content_template: templates/task
---

{{% capture overview %}}

{{< feature-state state="alpha" >}}

An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment.

In order to extract the best performance, optimizations related to CPU isolation and memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components.

_Topology Manager_ is a component in Kubelet that provides node level policies to enable these performance optimizations in an user abstract manner.

{{% /capture %}}


{{% capture prerequisites %}}

{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}

{{% /capture %}}

{{% capture steps %}}

## Topology Manager

Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other.
This can result in sub-optimal allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these sub-optimal allocations.
* Sub-optimal in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes those incurring additional latency.

The Topology Manager is a new Kubelet component, that acts as a source of truth for other Kubelet components to make topology aligned resource decisions.

The Topology Manager provides an interface for components, called *Hint Providers*, to send and receive topology information. The default *best-effort* algorithm takes
all possible NUMA Node combinations for an incoming container and chooses the best fit (being the narrowest NUMA node with resources available).

The Topology manager receives Topology information from the *Hint Providers* as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager polices preform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if a non-optimal hint is stored the preferred field for the hint will be set to false. In the current policies optimum is the narrowest preferred mask.
The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint.
The hint is then stored in the Topology Manager for use by the *Hint Providers* when making the resource allocation decisions.

### Configuration

The Topology Manager currently:
* Works on nodes with the `static` CPU Manager Policy enabled. See [control CPU Management Policies](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/)
* Works on Pods in the `Guaranteed` QOS Class {{< glossary_tooltip text="QoS class" term_id="qos-class" >}}.
If these conditions are met, Topology Manager will align CPU and device requests.

Topology Manager supports two allocation policies. You can set a policy via a Kubelet flag, `--topology-manager-policy`.
There are three supported policies:

* `none` (default)
* `best-effort`
* `restricted`
* `single-numa-node`

### none policy {#policy-none}

This is the default policy and does not perform any topology alignment.

### best-effort policy {#policy-best-effort}

For each container in a Guaranteed Pod, kubelet, with `best-effort` topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager stores the
optimal NUMA Node affinity for that container. If the affinity is not preferred,
Topology Manager will store this and admit the pod to the node.

The *Hint Providers* can then use this information when making the
resource allocation decision.

### restricted policy {#policy-restricted}

For each container in a Guaranteed Pod, kubelet, with `restricted` topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager stores the
optimal NUMA Node affinity for that container. If the affinity is not preferred,
Topology Manager will reject this pod from the node. This will result in a pod in the `Pending` state.

The *Hint Providers* can then use this information when making the
resource allocation decision.

### single-numa-node policy {#policy-single-numa-node}

For each container in a Guaranteed Pod, kubelet, with `single-numa-node` topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager determines if a single NUMA Node affinity is possible.
If it is Topology Manager will store this and the *Hint Providers* can then use this information when making the
resource allocation decision.
If, however, this is not possible the Topology Manager will reject the pod from the node. This will result in a pod in the `Pending` state.


### Example Usage

Consider the containers in the following pod specs:

```yaml
spec:
containers:
- name: nginx
image: nginx
```
This pod runs in the `BestEffort` QoS class because no resource `requests` or
`limits` are specified.

```yaml
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
```

This pod runs in the `Burstable` QoS class because requests are less than limits.

If the selected policy is anything other than `none` , Topology Manager would not consider either of these Pod
specifications.


```yaml
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
```

This pod runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.

Topology Manager would consider this Pod. The Topology Manager consults the CPU Manager `Static` policy, which returns the topology of available CPUs.
Topology Manager also consults Device Manager to discover the topology of available devices for example.com/device.

Topology Manager will use this information to store the best Topology for this container. In the case of this Pod, CPU and Device Manager will use this stored information at the resource allocation stage.

{{% /capture %}}
Original file line number Diff line number Diff line change
Expand Up @@ -261,9 +261,12 @@ kubectl delete namespace qos-example
* [Configure a Pod Quota for a Namespace](/docs/tasks/administer-cluster/quota-pod-namespace/)

* [Configure Quotas for API Objects](/docs/tasks/administer-cluster/quota-api-object/)

* [Control Topology Management policies on a node](/docs/tasks/administer-cluster/topology-manager/)
{{% /capture %}}






0 comments on commit f8d1fba

Please sign in to comment.