Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Topology Manager proposal. #1680

Merged
merged 8 commits into from
Jan 8, 2019

Conversation

ConnorDoyle
Copy link
Contributor

@ConnorDoyle ConnorDoyle commented Jan 25, 2018

/sig node

As noted in the text, this proposal only deals with aligning device and exclusive CPU assignment.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 25, 2018
Copy link
Contributor

@vikaschoudhary16 vikaschoudhary16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the draft. Few comments, all nits except one suggestion, inline.

contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
1. Iterate over the permutations of preference lists and compute
bitwise-and over the masks in each permutation.
1. Store the first non-empty result and break out early.
1. If no non-empty result exists, return an error.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should an error be returned in the case of non-strict numa assignment?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion non-strict assignment is still non-erroneous operation unless user indicates that strict assignment is required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the note that users can select strict vs preferred? I didn't see it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's called out specifically - but I think a Kubelet flag for strict/preferred NUMA alignment could be useful. Some users may want to fail guaranteed pods if they cannot get NUMA affinity across resources, whereas some may want the pod to run regardless - with a preference for NUMA affinity where possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a new pod annotation like kubernetes.io/RequireStrictNUMAAlignment ? The NUMA manager rejects a pod only when the pod has the annotation and NUMA manager cannot find a strict NUMA assignment for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note on opting in to strict mode.

spread among the available NUMA nodes in the system. We further assume
the operating system provides best-effort local page allocation for
containers (as long as sufficient HugePages are free on the local NUMA
node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future addition could be to extend cAdvisor to advertise huge pages available per NUMA node, enabling NUMA Manager to take huge pages available on each node into account when calculating NUMA node affinity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good point! Want to add that as something we can do for beta? Just send a PR against my branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to start simple on hugepages behavior. For roadmap, I don’t think anything special would be needed for alpha or even beta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added hugepages alignnment to the beta graduation criteria. If it's the case that no action is required (per performance test results) then it can be marked done via proper documentation of that fact.

Copy link

@Levovar Levovar Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that my colleagues already tested their DPDK based workload on vanilla(ish) kubernetes with CPUs and SRIOV VFs manually aligned, then observed that their unaligned hugepages caused additional packet processing delays in the 100ms range, I think we don't need to worry about extra documentation :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this test?

Copy link

@Levovar Levovar Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our lab, with our 4G "Cloudified" BTS application. exact test setup, and measurements probably I can get, the test probably I cannot reproduce within the community for obvious (SW licence :) ) reasons

but I'm fairly sure the same can be easily reproduced with testPMD too (I didn't put effort into it though, so it is just an assumption from my side)

easily when enumerating supported devices. For example, Linux
exposes the node ID in sysfs for PCI devices:
`/sys/devices/pci*/*/numa_node`.
1. Device Manager calls `GetAffinity()` method of NUMA manager when

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is out of scope:

For Device Manager, if available NUMA hints returned here, it means we have enough devices for the container, and preparatory work has already been completed.

Thus, we may need to consider the relationship between things we do here and that of pluginResourceUpdateFunc in predicateAdmitHandler (which could be duplicated)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. Here are two options:

  1. Make sure this admit handler always runs first
  2. Eliminate the side effects of the device plugin’s admit handler and add some kind of post-admit hook to do the device allocation after all of the admit handlers decide to really admit the pod.

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice write-up. I want more time to think on it, especially for scenarios where G pods can’t get an optimal local fit.

alignment for Guaranteed QoS pods. Alternatively, the NUMA manager could
simply provide best-effort NUMA alignment for all pods.

The NUMA Manager component will be disabled behind a feature gate until
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the NUMA manager only relevant in its first iteration if static cpu policy is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically speaking, yes.

  • static cpu policy
  • G pod
  • integer cpu request container
  • consuming a device
  • on a multi socket machine

We wanted to start with this narrow use case, but at the same time define a lightweight internal API for other NUMA concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might seem narrow. But this opens up use of Kubernetes to a chunk of use-cases that are already in production on other platforms.

@derekwaynecarr derekwaynecarr self-assigned this Jan 30, 2018
// NUMA-related resource assignments. The NUMA manager consults each
// hint provider at pod admission time.
type HintProvider interface {
GetNUMAHints(pod v1.Pod, containerName string) []NUMAMask
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When GetNUMAHints() returns empty slice of NUMAMask then calculating intersection of sets returned from all HintProviders is impossible. The function should be able to indicate if the value returned indicates indifference (when device plugin does not manage requested resources) or inability to offer a hint matching user's requirements.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps:

GetNUMAHints(pod v1.Pod, containerName string) []NUMAMask, bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you'd return nil, false to indicate "don't care"? That would work.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were just having a discussion about it with @ppalucki and we believe that there are following options:

  • return []NUMAMask{}, false when HintProvider is indifferent to pod spec and []NUMAMask{}, true when HintProvider cannot find matching configuration
  • return []NUMAMask{0b11} when HintProvider is indifferent to pod spec and []NUMAMask{0b00} when HintProvider cannot find matching configuration

Making a good choice here is about developer experience, I think - we should make the interface as straightforward to implement as possible.

- [Support VF interrupt binding to specified CPU][sriov-issue-10]
- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity]

Note that all of these concerns pertain only to multi-socket systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct behavior requires that the kernel receive accurate topology information from the underlying hardware (typically via the SLIT table). See section 5.2.16 and 5.2.17 of the ACPI Specification http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf for more information.

contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
1. Iterate over the permutations of preference lists and compute
bitwise-and over the masks in each permutation.
1. Store the first non-empty result and break out early.
1. If no non-empty result exists, return an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the note that users can select strict vs preferred? I didn't see it.

1. CPU Manager static policy calls `GetAffinity()` method of NUMA
manager when deciding CPU affinity.
1. Add `GetNUMAHints()` method to Device Manager.
1. Add NUMA Node ID to Device structure in the device plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does NUMA Manager do if there is only one NUMA node?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two options for this could be:

  1. NUMA Manager is only triggered on a multi-socket system. Perform check at NUMA Manager creation
  2. HintProviders will all return 1 (or maybe a -1) to NUMA Manager

contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
contributors/design-proposals/node/numa-manager.md Outdated Show resolved Hide resolved
@RenaudWasTaken
Copy link

With the current implementation of the Device Manager, the NUMA manager would need to be called after the Device Manager in the PodAdmitHandler list (devices for a pod are selected during admission).

I'm wondering what's the right model:

  1. Have the NUMA Admit Handler always be the last Admit Handler
  2. Have the device manager logic (device selection, gRPC calls, ...) be executed in GetNUMAHints

Neither option look good from a code point of view. Though from an architecture POV I do prefer 2) any ideas or suggestions on this?

@XiaoningDing
Copy link

@RenaudWasTaken I think this design requires NUMA admit handler to run before device manager and cpu manager (or any other NUMA hint providers).

About the second option, are you saying doing the Allocate() logic in GetNUMAHints() method? When GetNUMAHints() is called we don't know the final decision about NUMA placement yet, so I don't think we can do device selection logic there.

@lmdaly
Copy link

lmdaly commented Feb 21, 2018

@RenaudWasTaken what is your thinking behind NUMA Manager being called after Device Manager?
as @XiaoningDing said - the GetNUMAHints() call should only return a list of devices and their sockets for a requesting container - no device allocation decisions are made at this time.

@XiaoningDing agreed that we may have to call the NUMA admit handler before any hint providers.


# Alternatives

* [AutoNUMA][numa-challenges]: This kernel feature affects memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the performance of autonuma with CPU affinity before adding support for explicit NUMA node pinning.
If a pod can be restricted to a single CPU core and a set of devices accessible from that CPU socket, I really wonder if autonuma would just work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @vishh !

I looked a bit into autonuma as I wasn't aware of it, from what I understand of it:

When the time comes for the scheduler to make a decision, it passes over the per-process statistics to determine whether the target process would be better off if it were moved to another node. If the process seems to be accessing most of its pages remotely, and it is better suited to the remote node than the processes already running there, it will be migrated over.

It seems to me that what it is trying to solve is maximizing locality of CPU node and Memory, which might be a good thing for certain use cases of this design but wouldn't solve the problem for devices as they are "sending" memory from the NUMA node to the device.

Please correct me if I misunderstood something :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as soon as a process is launched with sched_setaffinity, autonuma no longer pays attention to those pids. IOW cpumanager disables autonuma for the pids in that cgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a pod can be restricted to a single CPU core and a set of devices accessible from that CPU socket I really wonder if autonuma would just work.

We don't yet have a way to know what NUMA node devices live on, or align that with CPU affinity. I agree with the above, but the precondition is what this proposal tries to address. We're still not suggesting to configure cpuset.mems

@RenaudWasTaken
Copy link

@RenaudWasTaken what is your thinking behind NUMA Manager being called after Device Manager?
as @XiaoningDing said - the GetNUMAHints() call should only return a list of devices and their sockets for a requesting container - no device allocation decisions are made at this time.

Yep that sounds like a better model than what I was imagining, somehow I got into thinking that we needed to do device assignment before.

the operating system provides best-effort local page allocation for
containers (as long as sufficient HugePages are free on the local NUMA
node.
- _CNI:_ Changing the Container Networking Interface is out of scope for
Copy link

@Levovar Levovar Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I may chime-in from Nokia, representing the often-quoted "NFVI use-cases" :)
could you elaborate a little bit more on the topic of interworking between CNI specification, CNI plugins and the newly introduced NUMA manager component? Possibly not relying on the Device Manager component.

To put the question into context: in a performance sensitive VNF one of the main requirement is to 1: have isolated (what this means exactly is a topic of another discussion), exclusive CPU(s) allocated to a Pod 2: said CPU(s) shall belong to the same NUMA node as the network devices used by the Pod
However, network devices can be managed and consumed via a number of different ways, both implicit and explicit. The PFs can be directly pushed to the Pod's netns, VFs can be created on top of PFs and then pushed into the pod's netns, or the PF can be further divided via other means like a DPDK-capable OVS, or VPP; managing virtiousers and vhostusers. (VF and PF control planes can be also bound to DPDK kernel drivers in which case sharing is done via their PCI address; just to complicate the situation even more).

Most of these management tasks are handled by CNI plugins, namely SRIOV (either huscat's or Intel's DPDK capable fork), vhostuser plugin, or any other proprietary CNI implementation. The way how the Device Manager proposal shares devices (through their Unix socket) does not necessarily make it compatible with NFVI requirements, which in turn raises the question that how can be the current NUMA manager proposal enhanced to satisfy above requirements if the DeviceManager does not, or cannot play a role in these use-cases?

So, trying to distill this train of thought into two short questions, or even additional requirements:

  • Dynamic, exclusive CPU allocation needs to be aligned with the configured CNI plugin; can the current NUMA manager proposal handle this UC in the future, and if yes, how?
  • If not, have you considered the possibility of CPU allocation based on static NUMA request, e.g. "give me X exclusive CPUs from NUMA node A, and Y exclusive CPUs from NUMA node B?" (This mode would enable operators to at least manually align CPUs and CNI-managed NICs in a homogen cluster until NUMA-manager <-> CNI communication is implemented)

@@ -0,0 +1,293 @@
# NUMA Manager
Copy link

@Levovar Levovar May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so as a result of our offline discussion I was asked to post the streamlined NFVI requirements to the proposal
Following RQs should be added to the proposal for it to be able to serve NFVI needs (mainly mobile networks radio VNFs which need these high-performance optimizations)

RQ1: NUMA manager shall gather hints from (selected) CNI plugins before/during admission
Use-case: must have for DPDK user-space networking Pods, using e.g. SRIOV, vhostuser etc. CNI to setup their networks. Otherwise application will experience serious performance drop.
Note, that reaching certain performance thresholds is a mandatory, functional requirement for such NFVI applications
Comment: a possible, Device Plugin / Device Manager based "workaround" solution was discussed regarding the SRIOV CNI plugin. This generic proposal should at least outline how this interaction is imagined to be implemented in Kubernetes, showing an exact example (it can be the SRIOV CNI).
Implementation shall consider this requirement as mandatory right from the beginning (in case community considers running NFVI workloads as high-priority), otherwise NFVI radio applications can't use Kubernetes in production (or cluster admin needs to implement non-Kubernetes based workarounds for this situation)

RQ2: Hugepages shall be allocated from the same NUMA node as NICs and exclusive CPUs
Use-case: also a must have for DPDK user-space networking Pods. Such application experience serious performance drop today due to this missing feature (scenario is real-life tested).
Note, that reaching certain performance thresholds is a mandatory, functional requirement for such NFVI applications
Comment: some discussions happened in the comment section regarding this feature, bit it needs to be included in the proposal as a functional requirements. The implementation plan realizing this requirements should be crisply outlined, and follow-up design plans formulated (if needed, for example about the creation of a MemoryManager component)
Implementation shall consider this requirement as mandatory right from the beginning (in case community considers running NFVI workloads as high-priority), otherwise NFVI radio applications can't use Kubernetes in production (or cluster admin needs to implement non-Kubernetes based workarounds for this situation)

+1: other offline identified NFVI requirements regarding CPU manager (e.g. CPU pooling etc.), and networking (e.g. true multi-interfaces support etc.) will be posted / discussed separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added hugepages alignment to graduation criteria for beta.

@RenaudWasTaken
Copy link

I'm currently looking at this from a Device Plugin Author perspective.

To implement GetTopologyHints() I would expect to be aware of the current state to make a decision about which sockets should be used for this pod.
This is currently an issue because of the "lazy GPU reclaim" strategy, I will not be aware of the latest state until I actually get the Allocate call that this pod will trigger.

What do you think?

@ConnorDoyle
Copy link
Contributor Author

@RenaudWasTaken perhaps the text needs clarification, but the expectation is that each Device descriptor from the plugin contains a socket ID. The device manager chooses the device to allocate based on this information, together with hints from other hint providers. Only after that does the allocate call occur. Does that work for you, or are you "faking" the device IDs in order to delay the choice of actual device to bind in your plugin?

@RenaudWasTaken
Copy link

RenaudWasTaken commented Oct 6, 2018

So I'll try to illustrate my thinking with a theoretical example, imagine a two NUMA node machine with one GPU on each node.

Story #1: The node receives two pod requesting one GPU each.

For the first request, I expect the device plugin to return both sockets as a possible option. For the second request I expect the device plugin to return only one socket.

Story #2: The node receives one pod requesting one GPU, that pod runs and terminates normally, the node then requests another pod requesting one GPU.

For the first request, I expect the device plugin to return both sockets as a possible option. For the second
request I also expect the device plugin to return both sockets as a possible option.

This is problematic today because of the following issues:

  • The plugin is not aware of the socket chosen, this one is easy to fix, we can send the information
  • The plugin is not aware of when a pod terminates, this is done "lazily" when a new pod comes in, the device manager looks at the active pods and reclaims GPUs used by pods which aren't active anymore.

Do these example make sense to you @ConnorDoyle ? Or am I misunderstanding the design doc?

@ConnorDoyle
Copy link
Contributor Author

Hi @RenaudWasTaken -- Based on the question, it looks like there might be a misunderstanding. The device plugin itself does not return hints. The Device Manager (internal to the Kubelet) implements the hint provider interface. It supplies info based on the set of unallocated devices and their declared affinity. So, the topology negotiation occurs before the plugin receives any allocate callback. I can update the text to make it clearer, and also outline more explicitly how the device plugin API would be affected. In short, the Device proto gets a field like TopologyInfo (mod naming) that initially contains just one member; the Socket ID.

From there, your scenario number 2 is supported because although the device plugin does not yet know that the pod is terminal and the device is available for a subsequent allocation, the Kubelet does know this and specifically the DeviceManager can accurately respond to the "get hints" callback from the TopologyManager during pod admission.

- Added device plugin protocol diff.
@ConnorDoyle
Copy link
Contributor Author

/cc @derekwaynecarr

@derekwaynecarr
Copy link
Member

per sig-node discussion, i think this document is implementable and we can do any other adjustments as part of that process.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ConnorDoyle, derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Unit test coverage.
* CPU Manager allocation policy takes topology hints into account.
* Device plugin interface includes socket ID.
* Device Manager allocation policy takes topology hints into account.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some way to introspect decisions may be via structured logs?

@vishh
Copy link
Contributor

vishh commented Feb 4, 2019

@ConnorDoyle @nolancon @lmdaly
it would be beneficial to not have to require choosing CPU cores while choosing a device. Instead if we only had to choose a CPU socket by default (and therefore a memory bank) when allocating external devices, that would ease fragmentation. Scheduling across CPU cores x memory banks x device availability sounds like a high risk recipe that would lead to fragmentation unless the workloads are shaped to fit the underlying machine.

Also, it would be great to have a list of common system designs that were considered to inform the algorithm being implemented for the topology manager which isn't described in much detail in the proposal.

If the hints are not compatible, the Topology Manager may choose to
reject the pod. Behavior in this case depends on a new Kubelet configuration
value to choose the topology policy. The Topology Manager supports two
modes: `strict` and `preferred` (default). In `strict` mode, the pod is
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a Kubelet configuration value, I think it would make more sense to have the pod determine whether it wants to either have "strict" NUMA-alignment or fail to schedule, or whether it prefers to prioritize being able to run at all over ensuring optimal performance.

After all, it's the workload that determines how performance-sensitive it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.