-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Topology Manager proposal. #1680
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the draft. Few comments, all nits except one suggestion, inline.
1. Iterate over the permutations of preference lists and compute | ||
bitwise-and over the masks in each permutation. | ||
1. Store the first non-empty result and break out early. | ||
1. If no non-empty result exists, return an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should an error be returned in the case of non-strict numa assignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion non-strict assignment is still non-erroneous operation unless user indicates that strict assignment is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the note that users can select strict vs preferred? I didn't see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's called out specifically - but I think a Kubelet flag for strict/preferred NUMA alignment could be useful. Some users may want to fail guaranteed pods if they cannot get NUMA affinity across resources, whereas some may want the pod to run regardless - with a preference for NUMA affinity where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a new pod annotation like kubernetes.io/RequireStrictNUMAAlignment ? The NUMA manager rejects a pod only when the pod has the annotation and NUMA manager cannot find a strict NUMA assignment for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note on opting in to strict mode.
spread among the available NUMA nodes in the system. We further assume | ||
the operating system provides best-effort local page allocation for | ||
containers (as long as sufficient HugePages are free on the local NUMA | ||
node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future addition could be to extend cAdvisor to advertise huge pages available per NUMA node, enabling NUMA Manager to take huge pages available on each node into account when calculating NUMA node affinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good point! Want to add that as something we can do for beta? Just send a PR against my branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to start simple on hugepages behavior. For roadmap, I don’t think anything special would be needed for alpha or even beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added hugepages alignnment to the beta graduation criteria. If it's the case that no action is required (per performance test results) then it can be marked done via proper documentation of that fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that my colleagues already tested their DPDK based workload on vanilla(ish) kubernetes with CPUs and SRIOV VFs manually aligned, then observed that their unaligned hugepages caused additional packet processing delays in the 100ms range, I think we don't need to worry about extra documentation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in our lab, with our 4G "Cloudified" BTS application. exact test setup, and measurements probably I can get, the test probably I cannot reproduce within the community for obvious (SW licence :) ) reasons
but I'm fairly sure the same can be easily reproduced with testPMD too (I didn't put effort into it though, so it is just an assumption from my side)
easily when enumerating supported devices. For example, Linux | ||
exposes the node ID in sysfs for PCI devices: | ||
`/sys/devices/pci*/*/numa_node`. | ||
1. Device Manager calls `GetAffinity()` method of NUMA manager when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is out of scope:
For Device Manager, if available NUMA hints returned here, it means we have enough devices for the container, and preparatory work has already been completed.
Thus, we may need to consider the relationship between things we do here and that of pluginResourceUpdateFunc
in predicateAdmitHandler
(which could be duplicated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. Here are two options:
- Make sure this admit handler always runs first
- Eliminate the side effects of the device plugin’s admit handler and add some kind of post-admit hook to do the device allocation after all of the admit handlers decide to really admit the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice write-up. I want more time to think on it, especially for scenarios where G pods can’t get an optimal local fit.
alignment for Guaranteed QoS pods. Alternatively, the NUMA manager could | ||
simply provide best-effort NUMA alignment for all pods. | ||
|
||
The NUMA Manager component will be disabled behind a feature gate until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the NUMA manager only relevant in its first iteration if static cpu policy is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practically speaking, yes.
- static cpu policy
- G pod
- integer cpu request container
- consuming a device
- on a multi socket machine
We wanted to start with this narrow use case, but at the same time define a lightweight internal API for other NUMA concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might seem narrow. But this opens up use of Kubernetes to a chunk of use-cases that are already in production on other platforms.
// NUMA-related resource assignments. The NUMA manager consults each | ||
// hint provider at pod admission time. | ||
type HintProvider interface { | ||
GetNUMAHints(pod v1.Pod, containerName string) []NUMAMask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When GetNUMAHints()
returns empty slice of NUMAMask
then calculating intersection of sets returned from all HintProvider
s is impossible. The function should be able to indicate if the value returned indicates indifference (when device plugin does not manage requested resources) or inability to offer a hint matching user's requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps:
GetNUMAHints(pod v1.Pod, containerName string) []NUMAMask, bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you'd return nil, false
to indicate "don't care"? That would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were just having a discussion about it with @ppalucki and we believe that there are following options:
- return
[]NUMAMask{}, false
whenHintProvider
is indifferent to pod spec and[]NUMAMask{}, true
whenHintProvider
cannot find matching configuration - return
[]NUMAMask{0b11}
whenHintProvider
is indifferent to pod spec and[]NUMAMask{0b00}
whenHintProvider
cannot find matching configuration
Making a good choice here is about developer experience, I think - we should make the interface as straightforward to implement as possible.
- [Support VF interrupt binding to specified CPU][sriov-issue-10] | ||
- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity] | ||
|
||
Note that all of these concerns pertain only to multi-socket systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct behavior requires that the kernel receive accurate topology information from the underlying hardware (typically via the SLIT table). See section 5.2.16 and 5.2.17 of the ACPI Specification http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf for more information.
1. Iterate over the permutations of preference lists and compute | ||
bitwise-and over the masks in each permutation. | ||
1. Store the first non-empty result and break out early. | ||
1. If no non-empty result exists, return an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the note that users can select strict vs preferred? I didn't see it.
1. CPU Manager static policy calls `GetAffinity()` method of NUMA | ||
manager when deciding CPU affinity. | ||
1. Add `GetNUMAHints()` method to Device Manager. | ||
1. Add NUMA Node ID to Device structure in the device plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does NUMA Manager do if there is only one NUMA node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two options for this could be:
- NUMA Manager is only triggered on a multi-socket system. Perform check at NUMA Manager creation
- HintProviders will all return 1 (or maybe a -1) to NUMA Manager
With the current implementation of the Device Manager, the NUMA manager would need to be called after the Device Manager in the PodAdmitHandler list (devices for a pod are selected during admission). I'm wondering what's the right model:
Neither option look good from a code point of view. Though from an architecture POV I do prefer 2) any ideas or suggestions on this? |
@RenaudWasTaken I think this design requires NUMA admit handler to run before device manager and cpu manager (or any other NUMA hint providers). About the second option, are you saying doing the Allocate() logic in GetNUMAHints() method? When GetNUMAHints() is called we don't know the final decision about NUMA placement yet, so I don't think we can do device selection logic there. |
@RenaudWasTaken what is your thinking behind NUMA Manager being called after Device Manager? @XiaoningDing agreed that we may have to call the NUMA admit handler before any hint providers. |
|
||
# Alternatives | ||
|
||
* [AutoNUMA][numa-challenges]: This kernel feature affects memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see the performance of autonuma with CPU affinity before adding support for explicit NUMA node pinning.
If a pod can be restricted to a single CPU core and a set of devices accessible from that CPU socket, I really wonder if autonuma would just work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @vishh !
I looked a bit into autonuma as I wasn't aware of it, from what I understand of it:
When the time comes for the scheduler to make a decision, it passes over the per-process statistics to determine whether the target process would be better off if it were moved to another node. If the process seems to be accessing most of its pages remotely, and it is better suited to the remote node than the processes already running there, it will be migrated over.
It seems to me that what it is trying to solve is maximizing locality of CPU node and Memory, which might be a good thing for certain use cases of this design but wouldn't solve the problem for devices as they are "sending" memory from the NUMA node to the device.
Please correct me if I misunderstood something :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as soon as a process is launched with sched_setaffinity, autonuma no longer pays attention to those pids. IOW cpumanager disables autonuma for the pids in that cgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a pod can be restricted to a single CPU core and a set of devices accessible from that CPU socket I really wonder if autonuma would just work.
We don't yet have a way to know what NUMA node devices live on, or align that with CPU affinity. I agree with the above, but the precondition is what this proposal tries to address. We're still not suggesting to configure cpuset.mems
Yep that sounds like a better model than what I was imagining, somehow I got into thinking that we needed to do device assignment before. |
the operating system provides best-effort local page allocation for | ||
containers (as long as sufficient HugePages are free on the local NUMA | ||
node. | ||
- _CNI:_ Changing the Container Networking Interface is out of scope for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I may chime-in from Nokia, representing the often-quoted "NFVI use-cases" :)
could you elaborate a little bit more on the topic of interworking between CNI specification, CNI plugins and the newly introduced NUMA manager component? Possibly not relying on the Device Manager component.
To put the question into context: in a performance sensitive VNF one of the main requirement is to 1: have isolated (what this means exactly is a topic of another discussion), exclusive CPU(s) allocated to a Pod 2: said CPU(s) shall belong to the same NUMA node as the network devices used by the Pod
However, network devices can be managed and consumed via a number of different ways, both implicit and explicit. The PFs can be directly pushed to the Pod's netns, VFs can be created on top of PFs and then pushed into the pod's netns, or the PF can be further divided via other means like a DPDK-capable OVS, or VPP; managing virtiousers and vhostusers. (VF and PF control planes can be also bound to DPDK kernel drivers in which case sharing is done via their PCI address; just to complicate the situation even more).
Most of these management tasks are handled by CNI plugins, namely SRIOV (either huscat's or Intel's DPDK capable fork), vhostuser plugin, or any other proprietary CNI implementation. The way how the Device Manager proposal shares devices (through their Unix socket) does not necessarily make it compatible with NFVI requirements, which in turn raises the question that how can be the current NUMA manager proposal enhanced to satisfy above requirements if the DeviceManager does not, or cannot play a role in these use-cases?
So, trying to distill this train of thought into two short questions, or even additional requirements:
- Dynamic, exclusive CPU allocation needs to be aligned with the configured CNI plugin; can the current NUMA manager proposal handle this UC in the future, and if yes, how?
- If not, have you considered the possibility of CPU allocation based on static NUMA request, e.g. "give me X exclusive CPUs from NUMA node A, and Y exclusive CPUs from NUMA node B?" (This mode would enable operators to at least manually align CPUs and CNI-managed NICs in a homogen cluster until NUMA-manager <-> CNI communication is implemented)
@@ -0,0 +1,293 @@ | |||
# NUMA Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so as a result of our offline discussion I was asked to post the streamlined NFVI requirements to the proposal
Following RQs should be added to the proposal for it to be able to serve NFVI needs (mainly mobile networks radio VNFs which need these high-performance optimizations)
RQ1: NUMA manager shall gather hints from (selected) CNI plugins before/during admission
Use-case: must have for DPDK user-space networking Pods, using e.g. SRIOV, vhostuser etc. CNI to setup their networks. Otherwise application will experience serious performance drop.
Note, that reaching certain performance thresholds is a mandatory, functional requirement for such NFVI applications
Comment: a possible, Device Plugin / Device Manager based "workaround" solution was discussed regarding the SRIOV CNI plugin. This generic proposal should at least outline how this interaction is imagined to be implemented in Kubernetes, showing an exact example (it can be the SRIOV CNI).
Implementation shall consider this requirement as mandatory right from the beginning (in case community considers running NFVI workloads as high-priority), otherwise NFVI radio applications can't use Kubernetes in production (or cluster admin needs to implement non-Kubernetes based workarounds for this situation)
RQ2: Hugepages shall be allocated from the same NUMA node as NICs and exclusive CPUs
Use-case: also a must have for DPDK user-space networking Pods. Such application experience serious performance drop today due to this missing feature (scenario is real-life tested).
Note, that reaching certain performance thresholds is a mandatory, functional requirement for such NFVI applications
Comment: some discussions happened in the comment section regarding this feature, bit it needs to be included in the proposal as a functional requirements. The implementation plan realizing this requirements should be crisply outlined, and follow-up design plans formulated (if needed, for example about the creation of a MemoryManager component)
Implementation shall consider this requirement as mandatory right from the beginning (in case community considers running NFVI workloads as high-priority), otherwise NFVI radio applications can't use Kubernetes in production (or cluster admin needs to implement non-Kubernetes based workarounds for this situation)
+1: other offline identified NFVI requirements regarding CPU manager (e.g. CPU pooling etc.), and networking (e.g. true multi-interfaces support etc.) will be posted / discussed separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added hugepages alignment to graduation criteria for beta.
I'm currently looking at this from a Device Plugin Author perspective. To implement What do you think? |
@RenaudWasTaken perhaps the text needs clarification, but the expectation is that each Device descriptor from the plugin contains a socket ID. The device manager chooses the device to allocate based on this information, together with hints from other hint providers. Only after that does the allocate call occur. Does that work for you, or are you "faking" the device IDs in order to delay the choice of actual device to bind in your plugin? |
So I'll try to illustrate my thinking with a theoretical example, imagine a two NUMA node machine with one GPU on each node. Story #1: The node receives two pod requesting one GPU each. For the first request, I expect the device plugin to return both sockets as a possible option. For the second request I expect the device plugin to return only one socket. Story #2: The node receives one pod requesting one GPU, that pod runs and terminates normally, the node then requests another pod requesting one GPU. For the first request, I expect the device plugin to return both sockets as a possible option. For the second This is problematic today because of the following issues:
Do these example make sense to you @ConnorDoyle ? Or am I misunderstanding the design doc? |
Hi @RenaudWasTaken -- Based on the question, it looks like there might be a misunderstanding. The device plugin itself does not return hints. The Device Manager (internal to the Kubelet) implements the hint provider interface. It supplies info based on the set of unallocated devices and their declared affinity. So, the topology negotiation occurs before the plugin receives any allocate callback. I can update the text to make it clearer, and also outline more explicitly how the device plugin API would be affected. In short, the From there, your scenario number 2 is supported because although the device plugin does not yet know that the pod is terminal and the device is available for a subsequent allocation, the Kubelet does know this and specifically the DeviceManager can accurately respond to the "get hints" callback from the TopologyManager during pod admission. |
- Added device plugin protocol diff.
/cc @derekwaynecarr |
per sig-node discussion, i think this document is implementable and we can do any other adjustments as part of that process. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ConnorDoyle, derekwaynecarr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Unit test coverage. | ||
* CPU Manager allocation policy takes topology hints into account. | ||
* Device plugin interface includes socket ID. | ||
* Device Manager allocation policy takes topology hints into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some way to introspect decisions may be via structured logs?
@ConnorDoyle @nolancon @lmdaly Also, it would be great to have a list of common system designs that were considered to inform the algorithm being implemented for the topology manager which isn't described in much detail in the proposal. |
If the hints are not compatible, the Topology Manager may choose to | ||
reject the pod. Behavior in this case depends on a new Kubelet configuration | ||
value to choose the topology policy. The Topology Manager supports two | ||
modes: `strict` and `preferred` (default). In `strict` mode, the pod is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a Kubelet configuration value, I think it would make more sense to have the pod determine whether it wants to either have "strict" NUMA-alignment or fail to schedule, or whether it prefers to prioritize being able to run at all over ensuring optimal performance.
After all, it's the workload that determines how performance-sensitive it is.
Add Topology Manager proposal.
/sig node
As noted in the text, this proposal only deals with aligning device and exclusive CPU assignment.