Node Topology Manager #693

lmdaly · 2019-01-17T10:16:32Z

lmdaly · 2019-01-17T10:17:21Z

/sig node
/kind feature
cc @ConnorDoyle @balajismaniam @nolancon

vishh · 2019-02-04T19:28:40Z

I can help inform this design based on learning from Borg. So count me in as a reviewer/approver.

jeremyeder · 2019-02-11T15:50:43Z

I can help inform this design based on learning from Borg. So count me in as a reviewer/approver.

Is there any public documentation on how this feature works in borg?

vishh · 2019-02-11T16:20:32Z

Not about NUMA AFAIK.

…

On Mon, Feb 11, 2019, 7:50 AM Jeremy Eder ***@***.*** wrote: I can help inform this design based on learning from Borg. So count me in as a reviewer/approver. Is there any public documentation on how this feature works in borg? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#693 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKPfanjb9Q5DxXAiBgv9C6Y809JX0ks5vMZFZgaJpZM4aE3uz> .

spiffxp · 2019-02-24T02:47:03Z

FYI @claurence

This tracking issue and KEP (#781) did not make it in time for the v1.14 enhancements freeze nor the extended deadline. I appreciate that you opened these before the deadlines, but they didn't seem to get sufficient review or sign off. This will need to go through the exception process.

Until we decide whether this is worth the exception, I'm inclined to put a hold on all PR's associated with this enhancement.

ref: kubernetes/kubernetes#72828

jiayingz · 2019-02-25T17:54:12Z

/cc @jiayingz @dchen1107

claurence · 2019-02-27T18:08:15Z

@lmdaly I see y'all are have 1.14 listed in the description as the alpha milestone - since there wasn't a merged implementable KEP this issue is not being tracked for 1.14 - if there are intentions for it to be included in that release please submit an exception request.

lmdaly · 2019-03-05T09:15:55Z

@lmdaly I see y'all are have 1.14 listed in the description as the alpha milestone - since there wasn't a merged implementable KEP this issue is not being tracked for 1.14 - if there are intentions for it to be included in that release please submit an exception request.

@claurence the KEP is now merged (KEP had been previously merged in the community repo. this was just to move it to the new enhancements repo as per the new guidelines), do we still need to submit an exception request to get this issue tracked for 1.14?

resouer · 2019-03-11T19:24:11Z

While after reading the design & WIP PRs througoutly, I have concerns that the current implementation is not generic as the original topology design we proposed in #781. This one currently reads more like NUMA topology in node level.

I left some comments for further discussion here: kubernetes/kubernetes#74345 (comment)

k82cn · 2019-03-20T00:46:42Z

the current implementation is not generic

Share the same concern about on that :) How about others, e.g. links between device (nvlinke for GPU)?

lmdaly · 2019-03-25T13:25:11Z

@resouer @k82cn The initial proposal deals only with aligning the decisions made by cpu manager and device manager to ensure proximity of devices with the cpu the container runs on. Satisfying the inter-device affinity was a non-goal of the proposal.

If however, the current implementation is blocking the addition of inter-device affinity in the future then I am happy to change the implementation once I get an understanding of how it is doing so,

klueska · 2019-04-05T13:09:50Z

I think the main issue I see with the current implementation and the ability to support inter-device affinity is the following:

To support inter-device affinity you normally need to first figure out which devices you would like to allocate to a container before deciding what socket affinity you would like the container to have.

For example, with Nvidia GPUs, for optimal connectivity, you first need to find and allocate the set of GPUs with the most connected NVLINKs before determining what socket affinity that set has.

From what I can tell in the current proposal, the assumption is that these operations happen in reverse order, i.e. the socket affinity is decided before doing the allocation of devices.

ConnorDoyle · 2019-04-05T14:24:11Z

That’s not necessarily true @klueska. If the topology hints were extended to encode point-to-point device topology, the Device Manager could consider that when reporting socket affinity. In other words, cross device topology wouldn’t need to leak out of the scope of the device manager. Does that seem feasible?

klueska · 2019-04-05T16:11:52Z

Maybe I'm confused about the flow somehow. This is how I understand it:

At initialization, device plugins (not the devicemanager) register themselves with the topologymanager so it can issue callbacks on it at a later time.
When a pod is submitted the kubelet calls the lifecycle.PodAdmitHandler on the topologymanager.
The lifecycle.PodAdmitHandler calls GetTopologyHints on each registered device plugin
It then merges these hints to produce a consolidated TopologyHint associated with the pod
If it decided to admit the pod, it returns successfully from lifecycle.PodAdmitHandler storing the consolidated TopologyHint for the pod in a local state store
At some point in the future, the cpumanager and the devicemanager call GetAffinity(pod) on the topology manager to retrieve the TopologyHint associated with the pod
The cpumanager uses this TopologyHint` to allocate a CPU
The devicemanager uses this TopologyHint` to allocate a set of devices
Initialization of the pod continues...

If this is correct, I guess I'm struggling with what happens between the point in time when the device plugin reports its TopologyHints and the time when the devicemanager does the actual allocation.

If these hints are meant to encode "preferences" for allocation, then I think what you are saying is to have a structure more like:

type TopologyHints struct {
    hints []struct {
        SocketID int
        DeviceIDs []int
    }
}

Where we not only pass a list of socket affinity preferences, but how those socket affinity preferences pair with allocatable GPU preferences.

If this is the direction you are thinking, then I think we could make it work, but we would need to somehow coordinate between the cpumanager and the devicemanager to make sure they "accepted" the same hint when making their allocations.

Is there something in place that allows this already that I missed?

eloyekunle · 2019-04-06T10:18:29Z

@klueska

I think what happens, making some minor corrections to your flow is:

At initialization, device plugins register themselves with the devicemanager so it can issue callbacks on it at a later time.
The lifecycle.PodAdmitHandler calls GetTopologyHints on each topology-aware component in the Kubelet, currently devicemanager and cpumanager.

In this case, what will be represented as topology-aware in the Kubelet are the cpumanager and the devicemanager. The topology manager is only intended to coordinate allocations between topology-aware components.

For this:

but we would need to somehow coordinate between the cpumanager and the devicemanager to make sure they "accepted" the same hint when making their allocations.

This is what the topologymanager itself was introduced to achieve. From one of the earlier drafts,

These components should coordinate in order to avoid cross NUMA assignments. The problems related to this coordination are tricky; cross domain requests such as “An exclusive core on the same NUMA node as the assigned NIC” involves both CNI and the CPU manager. If the CPU manager picks first, it may select a core on a NUMA node without an available NIC and vice-versa.

klueska · 2019-04-09T14:41:45Z

I see.

So the devicemanager and cpumanager both implement GetTopologyHints() as well as call GetAffinity(), avoiding direction interaction from the topologymanager with any underlying device plugins. Looking more closely at the code, I see that the devicemanager simply delegates control to the plugins to help fill in TopologyHints, which makes more sense in the end anyway.

Circling back to the original question / issue I raised though....

From Nvidia's perspective, I think we can make everything work with this proposed flow, assuming more information is added to the TopologyHints struct (and consequently the device plugin interface) to report point-to-point link information in the future.

However, I think starting with a SocketMask as the primary data structure for advertising socket affinity may limit our ability to expand TopologyHints with point-to-point information in the future without breaking the existing interface. The primary reason being that (at least in the case of Nvidia GPUs) the preferred socket depends on which GPUs are actually going to be allocated in the end.

For example, consider the figure below, when attempting to allocate 2 GPUs to a pod with optimal connectivity:

The GPU combinations of (2, 3) and (6, 7) both have 2 NVLINKs and reside on the same PCIe bus. They should therefore be considered equal candidates when attempting to allocate 2 GPUs to a pod. Depending on which combination is chosen, however, a different socket will obviously be preferred as (2, 3) is connected to socket 0 and (6, 7) is connected to socket 1.

This information will somehow need to be encoded in the TopologyHints struct so that the devicemanager can perform one of these desired allocations in the end (i.e. whichever one the topologymanager consolidates the hints down to). Likewise, the dependency between the preferred device allocations and the preferred socket will need to be encoded in TopologyHints so that the cpumanager can allocate CPUs from the correct socket.

A potential solution specific to Nvidia GPUs for this example would look something like:

type TopologyHint struct {
    SocketID int
    DeviceIDs []int
}

type TopologyHints []TopologyHint

devicemanagerhints := &TopologyHints{
    {SocketID: 0, DeviceIDs: []int{2, 3}},
    {SocketID: 1, DeviceIDs: []int{6, 7}},
}

cpumanagerhints := &TopologyHints{
    {SocketID: 1},
}

Where the topologymanager would consolidate these hints to return {SocketID: 1, DeviceIDs: []int{6, 7}} as the preferred hint when the devicemanager and cpumanager later call GetAffinity().

While this may or may not provide a generic enough solution for all accelerators, replacing SocketMask in the TopologyHints struct with something structured more like the following would allow us to expand each individual hint with more fields in the future:

Note that GetTopologyHints() still return TopologyHints, while GetAffinity()has been modified to return a single TopologyHint rather than TopologyHints.

type TopologyHint struct {
    SocketID int
}

type TopologyHints []TopologyHint

&TopologyHints{
    {SocketID: 0},
    {SocketID: 1},
}

type HintProvider interface {
	GetTopologyHints(pod v1.Pod, container v1.Container) TopologyHints
}

type Store interface {
	GetAffinity(podUID string, containerName string) TopologyHint
}

Thoughts?

Atharva-Shinde · 2023-02-08T18:29:01Z

Hey again @swatisehgal @klueska
Please try to get the KEP PR #3745 (addressing the changes mentioned above), merged before tomorrow's Enhancement Freeze :)
The status of this enhancement is still marked as at risk

swatisehgal · 2023-02-08T18:52:59Z

Hey again @swatisehgal @klueska Please try to get the KEP PR #3745 (addressing the changes mentioned above), merged before tomorrow's Enhancement Freeze :) The status of this enhancement is still marked as at risk

Thanks for the reminder, I am waiting for another round of PRR review.

swatisehgal · 2023-02-08T18:59:10Z

I have also pinged SIG node approvers on slack. Let's see if we can manage to get this in this release.

johnbelamaric · 2023-02-08T19:49:52Z

@swatisehgal ok, just commented - please take your reply and put it in the KEP and we're good to go for PRR

SergeyKanzhelev · 2023-02-08T23:58:16Z

@Atharva-Shinde this KEP should satisfy all requirements now. Ready to be marked as Tracked

Atharva-Shinde · 2023-02-09T08:34:06Z

Awesome!
With all the KEP requirements in place and merged into k/enhancements, this enhancement is all good for the upcoming enhancements freeze. 🚀

The status of this enhancement is marked as tracked. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

swatisehgal · 2023-02-15T15:49:48Z

/reopen

Some PRs are still pending and need to be merged for this work to be marked as complete.

k8s-ci-robot · 2023-02-15T15:49:53Z

@swatisehgal: Reopened this issue.

In response to this:

/reopen

Some PRs are still pending for this work to be marked as complete.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mickeyboxell · 2023-03-07T15:39:05Z

Hi @klueska @khenidak @lmdaly 👋, I’m reaching out from the 1.27 Release Docs team. This enhancement is marked as ‘Needs Docs’ for the 1.27 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.27 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by March 16. For more information, please take a look at Documenting for a release to familiarize yourself with the documentation requirements for the release.

Please feel free to reach out with any questions. Thanks!

klueska · 2023-03-07T15:46:27Z

@swatisehgal ^^

Atharva-Shinde · 2023-03-11T19:14:09Z

Hey again @swatisehgal @klueska 👋 Enhancements team here,
Just checking in as we approach 1.27 code freeze at 17:00 PDT on Tuesday 14th March 2023.

Here's where this enhancement currently stands:

All PRs to the Kubernetes repo that are related to your enhancement are linked in the above issue description (for tracking purposes).
All PR/s are fully merged by the code freeze deadline.

Also please let me know if there are other PRs in k/k we should be tracking for this KEP.
As always, we are here to help if any questions come up. Thanks!

swatisehgal · 2023-03-16T12:21:59Z

Please follow the steps detailed in the documentation to open a PR against dev-1.27 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by March 16. For more information, please take a look at Documenting for a release to familiarize yourself with the documentation requirements for the release.

Please feel free to reach out with any questions. Thanks!

@mickeyboxell I have created docs PR and linked it to the comment tracking GA graduation work here.

marosset · 2023-03-20T18:02:03Z

/stage stable

SergeyKanzhelev · 2023-05-05T23:01:47Z

NEXT: remove the feature gate in 1.29 and mark it as "implemented" in kep.yaml.

salehsedghpour · 2024-01-06T17:20:36Z

/remove-label lead-opted-in

KunWuLuan · 2024-01-30T07:58:44Z

Hi, I am new to use kubelet topology-manager with single-numa-node to manage my workloads. I have some questions, why we make this option as a node-level setting? Is there any problem if the option is a pod-level setting?

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 17, 2019

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2019

lmdaly mentioned this issue Jan 17, 2019

Topology Manager Tracking Issue kubernetes/kubernetes#72828

Closed

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2019

spiffxp added the tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team label Feb 24, 2019

Atharva-Shinde moved this from At Risk to Tracked in 1.27 Enhancements Tracking Feb 9, 2023

k8s-ci-robot closed this as completed in kubernetes/kubernetes#115590 Feb 15, 2023

k8s-ci-robot reopened this Feb 15, 2023

swatisehgal mentioned this issue Feb 27, 2023

node: topologymgr: Graduate Kubelet Topology Manager to GA kubernetes/kubernetes#116093

Merged

klueska moved this from Done to Looking at now in @klueska's k8s review queue Mar 3, 2023

klueska moved this from Looking at now to Ongoing Enhancements in @klueska's k8s review queue Mar 3, 2023

swatisehgal mentioned this issue Mar 16, 2023

node: topologymgr: docs: Kubelet Topology Manager graduation to GA kubernetes/website#40044

Merged

marosset removed the tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team label Mar 20, 2023

k8s-ci-robot added stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status and removed stage/beta Denotes an issue tracking an enhancement targeted for Beta status labels Mar 20, 2023

SergeyKanzhelev mentioned this issue May 5, 2023

Topology Manager is GA now #3988

Merged

k8s-ci-robot closed this as completed in #3988 May 12, 2023

github-project-automation bot moved this from Ongoing Enhancements to Done in @klueska's k8s review queue May 12, 2023

k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Topology Manager #693

Node Topology Manager #693

lmdaly commented Jan 17, 2019 •

edited by pacoxu

Loading

lmdaly commented Jan 17, 2019 •

edited

Loading

vishh commented Feb 4, 2019

jeremyeder commented Feb 11, 2019

vishh commented Feb 11, 2019 via email

spiffxp commented Feb 24, 2019

jiayingz commented Feb 25, 2019

claurence commented Feb 27, 2019

lmdaly commented Mar 5, 2019

resouer commented Mar 11, 2019 •

edited

Loading

k82cn commented Mar 20, 2019

lmdaly commented Mar 25, 2019

klueska commented Apr 5, 2019 •

edited

Loading

ConnorDoyle commented Apr 5, 2019

klueska commented Apr 5, 2019 •

edited

Loading

eloyekunle commented Apr 6, 2019 •

edited

Loading

klueska commented Apr 9, 2019 •

edited

Loading

Atharva-Shinde commented Feb 8, 2023

swatisehgal commented Feb 8, 2023

swatisehgal commented Feb 8, 2023

johnbelamaric commented Feb 8, 2023

SergeyKanzhelev commented Feb 8, 2023

Atharva-Shinde commented Feb 9, 2023

swatisehgal commented Feb 15, 2023 •

edited

Loading

k8s-ci-robot commented Feb 15, 2023

mickeyboxell commented Mar 7, 2023

klueska commented Mar 7, 2023

Atharva-Shinde commented Mar 11, 2023

swatisehgal commented Mar 16, 2023

marosset commented Mar 20, 2023

SergeyKanzhelev commented May 5, 2023

salehsedghpour commented Jan 6, 2024

KunWuLuan commented Jan 30, 2024

Node Topology Manager #693

Node Topology Manager #693

Comments

lmdaly commented Jan 17, 2019 • edited by pacoxu Loading

Enhancement Description

lmdaly commented Jan 17, 2019 • edited Loading

vishh commented Feb 4, 2019

jeremyeder commented Feb 11, 2019

vishh commented Feb 11, 2019 via email

spiffxp commented Feb 24, 2019

jiayingz commented Feb 25, 2019

claurence commented Feb 27, 2019

lmdaly commented Mar 5, 2019

resouer commented Mar 11, 2019 • edited Loading

k82cn commented Mar 20, 2019

lmdaly commented Mar 25, 2019

klueska commented Apr 5, 2019 • edited Loading

ConnorDoyle commented Apr 5, 2019

klueska commented Apr 5, 2019 • edited Loading

eloyekunle commented Apr 6, 2019 • edited Loading

klueska commented Apr 9, 2019 • edited Loading

Atharva-Shinde commented Feb 8, 2023

swatisehgal commented Feb 8, 2023

swatisehgal commented Feb 8, 2023

johnbelamaric commented Feb 8, 2023

SergeyKanzhelev commented Feb 8, 2023

Atharva-Shinde commented Feb 9, 2023

swatisehgal commented Feb 15, 2023 • edited Loading

k8s-ci-robot commented Feb 15, 2023

mickeyboxell commented Mar 7, 2023

klueska commented Mar 7, 2023

Atharva-Shinde commented Mar 11, 2023

swatisehgal commented Mar 16, 2023

marosset commented Mar 20, 2023

SergeyKanzhelev commented May 5, 2023

salehsedghpour commented Jan 6, 2024

KunWuLuan commented Jan 30, 2024

lmdaly commented Jan 17, 2019 •

edited by pacoxu

Loading

lmdaly commented Jan 17, 2019 •

edited

Loading

resouer commented Mar 11, 2019 •

edited

Loading

klueska commented Apr 5, 2019 •

edited

Loading

klueska commented Apr 5, 2019 •

edited

Loading

eloyekunle commented Apr 6, 2019 •

edited

Loading

klueska commented Apr 9, 2019 •

edited

Loading

swatisehgal commented Feb 15, 2023 •

edited

Loading