-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3698: Multi-Network #3700
KEP-3698: Multi-Network #3700
Conversation
mskrocki
commented
Dec 21, 2022
•
edited
Loading
edited
- One-line PR description: Multi-Network
- Issue link: Multi-Network #3698
a network | ||
* Identify what “Object” is the Primary network | ||
* Optional parameters: MAC address, IP address, speed, MTU, interface name etc. | ||
14. Every Pod connected to specific network (represented by the “Object”) must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a necessary limitation for all "Objects"? Perhaps a network is meant only for north-south traffic. I wonder if this could be left to be defined by the implementation, similarly to the next point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be too open. Today's K8s network has exactly this requirement, this way we standardize across all the networks.
Do you have a use case for such situation?
the “Object”) | ||
15. Pods attached to a network are connected to each other in a manner defined | ||
by the “Object” implementation | ||
16. Basic network Interface information for each attachment will be exposed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a comparable downward API today? If not, I wonder if this could be deferred to a later phase. I'm concerned about implications of hot-plug or dynamic IPAM on this. It would also make this first phase a little leaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think there is one. Fair point on maybe moving it to later phase. What is your concern in regards to hot-plug or dynamic IPAM?
16. Basic network Interface information for each attachment will be exposed to | ||
runtime Pod (via e.g. environment variables, downward API etc.) | ||
|
||
#### Phase II (scheduler, kubelet and API probing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would QoS and resource handling fit here? Since this phase is concerned with scheduler, maybe it should cover resource management too - if network would be in the same category as CPU or memory, it would suggest using the same scheduling/pooling mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think QoS and "network-resourcing" as a whole would be a new requirement. I would consider this as an future extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be "implementation specific"? aka something your CNI plugin(s) deal with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal @mskrocki .
during cluster creation that is available to the Pod when no additional | ||
networking configuration is provided. | ||
* **Primary network** - This is the network inside the Pod which interface is | ||
used for the default gateway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meetings I've been to, people keep getting confused about these two. It certainly doesn't help that the definition of "primary network" pretty much has to use the word "default".
You could call the first one the "Cluster-wide network"? Or "Traditional Kubernetes network"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional networks can be "cluster-wide" and/or "Traditional Kubernetes networks". Naming is always tricky, do you have other recommendations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think you cannot avoid the "default" term in both scenarios due to historical semantics of networking. The "default network" could be "default instance" or other term based on naming of the object.
The "primary network" could be referred to as "primary attachment" where here the logic is to specify to which object the Pod "primary" interface must be attached
18. Kubelet network-based probing is optional for the “Object” connections to | ||
Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional for who?
I think what you mean is "if a Pod is not attached to the default cluster-wide network, then it is not defined whether network-based probes to that pod will work or not".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored.
19. “Object” connections to Pod are optionally able to connect to Kubernetes | ||
API - the Pods connections via non-default Pod network does not require access | ||
to Kubernetes API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are '"Object" connections to Pod'? I'm imagining an object in etcd calling connect()
somehow...
I guess you mean "Pods are not guaranteed to be able to reach the Kubernetes API via networks other than the cluster-default network" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored.
20. Kubernetes API can optionally reach to “Object” connections to | ||
Pod - Kubernetes API access to the Pod via non-default Pod network is not | ||
required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The Kubernetes apiserver is not guaranteed to be able to reach pods via networks other than the cluster-default network"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored.
20. Kubernetes API can optionally reach to “Object” connections to | ||
Pod - Kubernetes API access to the Pod via non-default Pod network is not | ||
required | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phase II doesn't make sense as a separate phase. Each of the requirements 17-20 says "you don't have to do X". If Phase II is something above and beyond Phase I, then does that imply that in Phase I, you do have to do all those things? Eg, in Phase I, all networks must be available on all nodes, network probes have to work on all networks, etc, and then in Phase II we make it all optional?
I think Phase II needs to just be squashed into Phase I.
Unless everything in Phase II is kind of the opposite of what it says, and you really mean "in Phase I, network probes definitely can't be used on non-default networks but then in Phase II we will define an optional way to make it work". In which case, that needs to be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is latter part, where we would add support for such capabilities. I have reworded the requirements to better reflect this.
21. “Object” connections to Pod are optionally able to provide Service, | ||
NetworkPolicies functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Connections to Pods over secondary networks" or whatever. '"Object" connections to Pod' is just weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored.
12. The Pod reference to a Networks is optional and when NOT specified, Pod | ||
connects to “Default” “Object” (network the cluster has been created with) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion, #3700 (review), pod may NOT connect to "Default" network and it connect to "namespace-default" Network, instead of "Default".
This may changes current behavior and it violates the first requirement, 1. This effort shall not change the behavior of Today existing clusters
, because existing cluster connects all pod in cluster network (I believe that is "Default" network in your KEP).
So could you please refine them to be consistent? (change requirement 12 and "namespace-deafult" network, or just remove requirement 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the override of namespace-default network will work will be defined in future. This req is about the default behavior to explicitly satisfy req 1.
This all seems a bit too abstract. KEPs traditionally propose either some Kubernetes functionality or an API object. It reads like the charter to a working group rather than a KEP. Is there a way this can be closer to a specific proposal? I'd like to see the API types :-). |
When the dust settles we need to be able to have a k8s pod do the equivalent of what docker/podman can do today. (Apologies for those on the calls who already heard all this.) For example, using podman, since it supports the same CNI as k8s, I can use:
This simple example puts 4 interfaces in the container. What I want to do with them doesn't need to be a concern of the apiserver. In each case I pick plugins, and the .conflist configures them to solve my problem. What I use and how they are configured are implementation specific. As discussed, we will need a way to pass plugin specific info (static IP, static mac etc.) to each network/.conflist Most/all use cases I've seen in the doc can use docker/podman to demonstrate. |
for performance purposes (high bandwidth, low latency), that my user-space | ||
application (e.g. DPDK-based) can use. The VF will not use the standard netdev | ||
kernel module. The Pod’s scheduling to the nodes should be based on hardware | ||
availability (e.g. devicePlugin or some other way). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, when you dig into this, what you discover is that you have a workload that requires some sort of special treatment from the physical network and a fast path to it.
The problem is that this requirement is entirely mis-focused. Literally nobody wants an SRIOV VF. What they want is a fast path to the physical network doing a specific set of things for them.
You can tell this because
- If you offer someone a VF leads to the physical network mishandling the packets pushed into the VF, they will be very unhappy.
- If you can't offer them a SRIOV VF because your hardware only supports SIOV (nexts technology coming after SRIOV) they will be very unhappy (even if you could have given them a vfio for the SIOV that is connected to the proper physical network treatment)
- If you offer them vfio for the VF when they needed a kernel interface, they will be unhappy.
All of this becomes even more crucial when you are scheduling the Pod. If I schedule the Pod to a Node that just has SRIOV VFs available, but its NICs lack the required capabilities, or cannot provide a fast path to the required sort of treatment, this all breaks down.
There are important sets of requirements in here... but I would recommend thinking them through outside of the 'interface' 'network' paradigm being applied here to get to the real root of the need.
I'd suggest rethinking the requirement in terms of:
- What Network Service does the user want from the physical network?
- How do they want to consume it? Via what mechanism? (kernel interface, rdma, vfio, etc)
- What capabilities do they need from the fast path connection to that Network Service?
#### Story #3 | ||
I have implemented my Kubernetes cluster networking using a virtual switch. In | ||
this implementation I am capable of creating isolated Networks. I need a means | ||
to express to which Network my workloads connect to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship Agreed. This requirement feels like it would be covered by a properly written 'traffic segregation' requirement.
As a Virtual Machine -based compute platform provider that I run on top of | ||
Kubernetes and Kubevirt I require multi-tenancy. The isolation has to be | ||
achieved on Layer-2 for security reasons. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this have to be achieved at L2 for security reasons? That's an oddly specific requirement. What security reasons?
This is particularly worrisome as K8s carefully avoids having any L2 concepts in it. It would be unfortunate to re-introduce them for such sparse requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case is specific on purpose (to concur some of the above comments about use cases being too vague). This is NOT about adding L2/L3 notions into k8s, its about handling such use case. Checkout this sections top description, that any technology or product mentioned here does NOT mean it will be supported, and this is reflected in below requirements where none is talking about supporting specific layer or product.
As a platform operator I need to connect my on-premise networks to my workload | ||
Pods. I need to have the ability to represent these networks in my Kubernetes | ||
cluster in such a way that I can easily use them in my workloads. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get more clarity on this case? There are lots of ways of achieving this already using BGP in many existing CNI plugins. Is this perhaps another traffic segregation case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Challenge ... most existing CNI plugins do it by specific CRDs which do not necessarily coexist well. Purpose is to offer a solution where K8S could support such implementations without specific CRDs.
Doing so, the challenge of supporting workloads and avoid the "platform certification" mess might get/should be simpler.
“profiles” for Namespace, where I can not only change default Network, | ||
but define a set of Networks assigned to given Namespace that Pods created in | ||
that NS are automatically attached to. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK... so there's a desire here to isolate workloads (traffic segregation). This time by namespace. It sounds like in this case there's a desire to have a segregated primary network by namespace, but its unclear to me from the requirement as written why this would imply a 'set of Networks'
It feels like this may be smashing two requirements together:
- A desire to have per-namespace traffic segmentation for primary networks. If so this opens interesting questions about how that interacts with the rest of K8s networking
- A desire to add a set of networks by namespace to a Pod... but its unclear why in this requirement.
2. We need to introduce an “Object” that represents the existing | ||
infrastructure’s networking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why? Why do we need to introduce an Object here? This feels more like the 'solution' side of things than the 'requirement' side of things.
As engineers, we all have a serious tendency to decide on a solution and then decide that solution is a requirement. We all do it (I more than most). This screams that here. Lets get the real requirements underpinning this proposed solution.
4. The “Object” shall not define any implementation specific parameters in that | ||
object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also confused.
5. “Object” shall provide option to define: | ||
* IPAM mode: external/internal | ||
* List of route prefixes - optional and not forced on the implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to @danwinship 's points here (which I support)... how are we handling or preventing IPAM that conflicts with the Cluster/Service CIDRs and routes that highjack Cluster/Service CIDRs?
5. “Object” shall provide option to define: | ||
* IPAM mode: external/internal | ||
* List of route prefixes - optional and not forced on the implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On other point to make: by putting this 'Object' in the K8s API Server, it intrinsically is moved out of the world of the people actually administering the network and into the world of people administering the cluster. Generally speaking in my experience this will work some of the time, break horribly some of the time, and work with a lot of pain for all involved some of the time, depending on the interactions between the K8s administrators and the folks administering the physical network involved.
Having the flexibility to move this point of control around is going to be crucial, and that argues against it necessarily being required to live in the K8s API server.
/cc |
for performance purposes (high bandwidth, low latency), that my user-space | ||
application (e.g. DPDK-based) can use. The VF will not use the standard netdev | ||
kernel module. The Pod’s scheduling to the nodes should be based on hardware | ||
availability (e.g. devicePlugin or some other way). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Dan that this is a particularly compelling user story. As for SRIOV VF I get that this was used as an example and so focusing in on it specifically might be distracting, but I also think it might be worth taking a couple seconds to add a couple more examples just to moot the point.
2. We need an API object representing network, which a Pod, a Service or any | ||
other feature that wishes to support Multi-Network capability could reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still now I am not quite clear that why we need this object, as @edwarnicke adds in comment
Could you please elaborate it to clarify:
- What the 'object' archieved to Kubernetes users?
- Why the 'object' is required for users?
- Without the 'object', what is not archieved?
Based on the current Kubernetes community situation, multus, network service mesh and other network components, outside of Kubernetes core, do achieve multi network without core network objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today's implementation are creating CRDs, that have to be referenced via annotations or other in-direct means. Annotation is not meant to define configuration of an object, additionally there is no validation of the user input.
On top of that every implementation has its own CRDs and way of configuration, so there is no one standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today's implementation are creating CRDs, that have to be referenced via annotations or other in-direct means.
Some network implementation uses CRDs, but not all. For example, flannel does not use CRDs. In addition, each implementation has several CRDs to specify network but it is depends on the implementation. Some implementation use a CRD for 'network', but some implementation use CRDs for 'network IPAM' and 'network transport (i.e. SDN configuration)'.
Your design assumes that one network object is required for everyone, for each implementation, but actually it is not, I guess. So please clarify and give a reason why we need the object. Currently the reason seems to be ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is to have a handle/representation of "network" Pod connects/attaches to. This does NOT exists Today. This is mentioned in the Summary and Motivation, maybe it is not sufficient. I will add more there.
|
||
### Goals | ||
|
||
Define user stories and requirements for the Multi-Network effort in Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At today's multi-network sync (May 10th, 2023 - notes available here) there was a conversation involving several members of the community regarding the desire for this KEP to better define standards for defining additional parameters of a network interface for a Pod
(perhaps particularly to help support DRA and QoS configurations) which may be implementation-specific, or perhaps generic enough for all implementations as an explicit goal of this KEP.
Personally I agree with the high level goal, but I think we should avoid trying to put new things in core if possible for a variety of long term maintenance reasons. There are other takes on this which I invite conversation to 🧵 here about.
Ultimately, I would like to see that conversation continued and resolved before we merge this KEP and the goals here adjusted accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shaneutt. I need to read this KEP with thought, but ulitmately I think that DRA (if any) would be in line with requirements. The QoS resources KEP (#3004) is something that could possibly be used to cover some current usage scenarios involving e.g. annotations. But it's good to understand what it is and how it's different.
cc @pohly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts are that there are two places where we might have arbitrary config.
- At the network level. For example, parameters specifying what VLAN ID this network has, or what IP address ranges it is allowed to use.
- At the network interface level. For example, parameters specifying the QoS for a particular interface, or the allowed bandwidth, or the required failure domain to allow redundancy.
These are all going to be implementation dependent, and we shouldn't put them into core Kubernetes. The easiest way to do this is to have a CR with whatever specific implementation specific horrors are required in it, and allow the standard K8s objects to have a reference to that CR (with it being entirely up to the implementation what to expect in the CR and what to do with it).
There is already a reference to a CR in the PodNetwork resource. My takeaway from the discussion was that the PodNetworkAttachment resource probably needs exactly the same capability.
I'm open to some common fields making their way into the PodNetworkAttachment (say MAC address or even IP as well as interface name or similar), but I can see that turning into something pretty ugly pretty fast if we aren't careful, so shoving as much as possible into a CR seems safer. I'm also open to this being "something we add in a later phase, but agree we expect to do later". Finally, I think the alternative is some kind of annotation model, where we optionally list interfaces and references to their config in annotations in the pod, which is plausible; a bit ugly but also less intrusive to core K8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a requirements KEP, we should try to avoid blocking it on other KEPs unless we think it is fundamentally a non-starter for the community. The cost paid for work on refining requirements is community time rather than long-term tech debt.
* **Cluster Default PodNetwork** - This is the initial cluster-wide PodNetwork | ||
provided during cluster creation that is available to the Pod when no | ||
additional networking configuration is provided in Pod spec. | ||
* **Primary PodNetwork** - This is the PodNetwork inside the Pod which interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be specific about the cardinality relationship between PodNetwork and interfaces. Is it always 1-1? If not, we will need to be very careful whenever one is mentioned vs another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we will probably land up with either a scope field in the API or different APIs (Kinds) for each of these use-cases ? Wondering if the following nomenclature pattern is more intuitive.
- Cluster Default PodNetwork
- Namespace Default PodNetwork (For future expansion, note Default is not the name of the namespace, this is the default network associated with a namespace)
- Default PodNetwork (instead of primary network, this is the one assigned to the pod, either by policy or by the user picking one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bowei The cardinality relationship is going to be defined in the Phase I. I dont want to mention that in this phase.
@pvenu-Google using "Default PodNetwork" when we already have "Cluster Default PodNetwork" will not be distinguishable, most folks will keep skipping the "Cluster" word which will lead to confusion. And we want to keep the "Default" PodNetwork for what it is now to align with the "default" namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the big picture here is correct.
interfaces | ||
* applications leveraging performance-oriented interfaces (e.g. `AF_XDP`, | ||
`memif`, `SR-IOV`) | ||
* applications requiring support for protocols not yet supported by Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means... If you only care about supporting the protocol on the primary network interface then isn't it just a matter of selecting a primary network plugin that supports that protocol? (I'm not sure what level of "protocol" you're referring to, but SCTP and multicast are examples here: k8s conformance doesn't require either of those, and many plugins don't allow them, but some do.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do mean Networking protocols e.g. multicast as you mentioned. Agree that you could change primary CNI and support any protocol, but what if I dont want to touch my default Network configuration but still add support for the new protocol. With MN I can just create new PodNetwork that would use the additional/new CNI, and it only for those Pods that needs it.
* Introduce API object to kubernetes describing networks Pod can attach to. | ||
* Evolve current Kubernetes networking model to support multiple networks, | ||
defining the new model in a backwards compatible way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's increasing consensus that we ought to have a better network model, and objects that describe networks, even in the single-network case (eg "KNI").
I'm not sure what the right way to talk about this in the KEP, since KNI is even more pre-alpha than Multi-Network. But I think it's important to be thinking of this as not just about multiple networks, but also about having a better (and more k8s-native) model of pod networking in general, which in particular supports multiple networks.
|
||
### Non-Goals | ||
|
||
Define the CNI implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly sure what you mean here (even after finishing reading the KEP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say here that this KEP does not specify how to implement the CNI, but just provide API abstraction/handler.
for performance purposes (high bandwidth, low latency), that my user-space | ||
application (e.g. DPDK-based) can use. The VF will not use the standard netdev | ||
kernel module. The Pod’s scheduling to the nodes should be based on hardware | ||
availability (e.g. devicePlugin or some other way). This interface might not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear if you mean "k8s will implicitly use a devicePlugin or something to handle scheduling" or "the user will need to explicitly think about devicePlugins to get scheduling to work right".
(The latter is how it works in Multus; when using SR-IOV, it's not enough to just request the SR-IOV NetworkAttachmentDefinition, you have to also request an SR-IOV resource. But this is annoying and makes SR-IOV NADs work differently from other kinds of NADs (eg, ipvlan-based). Of course, Multus can't make scheduling happen automatically because the scheduler doesn't know anything about Multus, but a k8s-native API ought to be able to do better here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact functionality is described in DRA https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit?disco=AAAA9O0je4I , I don't think we should duplicate functionalities , we need to understand what is missing in DRA (that is being revisited these days) and suggest the necessary changes if we can reuse that mechanism for this
// Examples: eth1 or net1 | ||
// | ||
// +optional | ||
InterfaceName string `json:"interfaceName,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it make sense to specify this here rather than as a ParametersRef
? Particularly given that it's not even meaningful for some network types (eg, the bandwidth user story).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to ensure the name is unique across the list of Networks specified. Otherwise there is no control, especially if I will attach 2 PodNetworks from 2 different vendors.
// for it. | ||
// | ||
// +optional | ||
IsDefaultGW4 bool `json:"isDefaultGW4,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, why does it make sense to specify this here rather than as a ParametersRef
? It seems unlikely that for a given PodNetwork, some pods would want to have it as their default route and others would not. (Though of course, if that were they case, they could just use different PodNetworkAttachments.)
in particular, having separate IPv4 and IPv6 default routes is a really niche use case
(as with IPAM, I feel like the eventual solution here is a standard parameter type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as with interfaceName, how do I coordinate 2 PodNetworks that both indicate IsDefault?
agree with the v4 v6 split, let me remove that and if need be we can make it more complex in future.
We have considered using classes (similar to GatewayClass etc.), but we do not | ||
expect any extra action for a PodNetwork to take place for specific implementation. | ||
PodNetwork already points to a custom resource, which implementation can hook on | ||
for any specific extra configuration. At this point of time we consider using | ||
classes as overkill, and in future if such need arises, we can reuse the provider | ||
field for that purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think your model of how this works is that there's a single component in the cluster responsible for all PodNetworks.
But another model would be to allow multiple PodNetwork-managing components in the cluster, each supporting different functionality, and managing only the PodNetworks of their own PodNetworkClass. For example, there might be separate PodNetworkClasses implementing each of the User Stories (though there could also be "heavyweight" multi-use-case PodNetworkClasses like the implied universal PodNetworkClass currently envisioned by the KEP).
I think the PodNetworkClass model would lead to more reused code and more common functionality across cluster types. Given the model of the current KEP, it's possible (indeed, likely) that, say, GKE and OpenShift could both provide ways of implementing all of the User Stories, but those ways would be entirely incompatible, involving different Provider
names and ParametersRefs
of completely different CRD types. Whereas with the PodNetworkClass model, GKE and OpenShift might each have some PodNetworkClasses of their own, but could also make use of "generic" ones (possibly ones written by users themselves).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiple implementation in cluster suppose to be handled by the provider field, assuming the implementations support that field, and this makes the interoperability between implementation dependent on each other.
With class this is enforced if we make that field mandatory. Will bring this up in next sync meeting and talk it thru.
#### Story #8 | ||
As Cluster operator I wish to manage my cluster’s bandwidth usage on a per-pod | ||
basis, simultaneously preserving all other Pod networking configuration the same | ||
across those Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this, the less it seems to fit with this KEP.
Or at least, it doesn't fit as a PodNetwork. This seems more like a PodNetworkAttachment that points to the default
PodNetwork but also includes a ParametersRef pointing to some well-known parameter type containing bandwidth information...
(And maybe that's actually what you meant? I guess it's not clear from the story itself...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship you have references to PodNetworkAttachment
as in different places, and if for that you mean the representation on a Pod of a network interface https://docs.kernel.org/networking/netdevices.html, then this is the part I agree 100% we need to solve first.
And this is what actually DRA is solving #3063 , it seems that the API is still revisiting and based on my current reading it can accomodate perfectly the use cases of PodNetworkAttachment
Once we have this capabilities we can use them to build the other high level abstractions required in this KEP
rule of: single Pod references a given PodNetwork only 1 time | ||
|
||
#### Auto-population | ||
When networks field is not set, and hostNetwork is not set, we will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens with pods with hostNetwork: true? to which network do they belong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None, MN does not apply to HN pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so how does this work then, if I have 3 pods, podA in networkA pod B in networkB and podC in host network, which Pods are able to communicate with podC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Today there is only 1 network, root ns and Pod ns is treated as one. And we need to still assume that, and we call this a "default" Network, pods connected to that Network has to be backward compatible. Any other networks behavior is up the implementation. For some, isolation is a factor for some it is not, we cannot enforce that.
The above status is expected to be populated by kubelet, but this can only happen | ||
after CRI provides support for the new Pod API. Because of that, initially | ||
kubelet will behave as it does today, without updating the additional fields. | ||
Until CRI catches up, the PodNetwork providers will be able to update that field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not merge split functionality, CRI API changes has to be done at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRI changes has to come with container runtime changes. Considering all the functions of CRI take Pod as an argument, the required changes are there from my point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering all the functions of CRI take Pod as an argument, the required changes are there from my point of view.
Making changes on the Pod API does not mean runtimes will adopts these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that is fine, I have agent-based CNI, that does not care about them.
Considering all above changes are in the direct scope of CRI, this KEP will not | ||
propose complete changes for them, and a separate KEP will be created to cover it. | ||
Below are some suggestions on what the changes could look like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if those changes are required for this KEP they must be in this KEP, or are you suggesting we merge some changes first and the others later?
the changes merged are going to work without the other changes? we can have merged something that is not functional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would see that we introduce this API as first step. Then container runtimes catch up and adjust to the new fields Pod has. Otherwise the change just grows too large. Lastly I can solve this with agent-based CNI Today, that does not rely on CRI, until they catch up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would see that we introduce this API as first step. Then container runtimes catch up and adjust to the new fields Pod has.
I think you need to provide more details on this, how much do you estimate container runtime catch up on features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK, considering KNI proposal is on the table as well, maybe then it will be independent. My point is, I do not want to make dependency on container runtimes.
### Endpointslice controller changes | ||
Considering changes to Pod.PodStatus.PodIPs list, we must ensure that the | ||
controller is using the correct IPs when creating the endpoints. We will ensure | ||
that only IPs for "defautl" PodNetwork will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires a much deeper explanation as this is the core functionality of Services and in extension Ingress and Gateway API. Services are namespaced, so it seems that there can be Services that has Pods in different networks, - how are the EndpointSlice controller going to handle these Services?
- If there can be overlapping IPs, and some of the Pods in the service belong to different networks with overlapping, how this will be handled?
- Where is the ClusterIP going to be present, in all networks or only in one? which one?
- For headless services, how DNS will be used in each network? How the resolv.conf of the Pods will look like, specially on multi homed pods?
- Webhooks use Services and have to be connected from the apiserver, if overlapping IPs is allowed, how can you guarantee that the apiserver is able to get to all the pods in all networks? it must be modified to be network aware ?
- same for the kubernetes.default service, it has to be reachable from all networks, how the endpoint reconciler in the apiserver will choose the endpoints for all networks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP does not introduces support for Services. Based on @danwinship feedback, we cannot reuse PodIPs field in status anyway, and have to add new field, which will not affect this controller, I have removed this section.
that only IPs for "defautl" PodNetwork will be used. | ||
|
||
### Kubelet changes | ||
We will introduce an additional check for kubelet readiness for networking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to handle also the resolv.conf and all the dns configuration in the kubelet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolv.conf of the node or one that Pods gets? I assume Node's one is handled by the platform, and DNS for Pods is not in-scope for this Phase.
Lastly, based on @danwinship feedback above, I realized that having scheduler guarding Pods from being send to Node before required PodNetwork is present covers this functionality, I removed this section.
|
||
### Scheduler changes | ||
These are the changes we will do in Pod scheduler: | ||
* Provide active Pod spec validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand on what are those active pod spec validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
described above, will provide a link in the text
These are the changes we will do in Pod scheduler: | ||
* Provide active Pod spec validation | ||
|
||
When one of the multi-network validation fails, scheduler will follow the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require sig-scheduling approval and also provide more detail, specially about the possible implications on performance and autoscaling
cc: @alculquicondor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this validation happen?
Are you adding any new scheduling plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I have not investigated this deeper. This is just a description of what I would like to achieve on the scheduler side. If I can just do this with a plugin, then yes. Will follow up.
#### Pod Creation | ||
The Pod creation is handled by the SyncPod function, which calls RunPodSandbox | ||
([code](https://github.com/kubernetes/cri-api/blob/release-1.28/pkg/apis/runtime/v1/api.proto#L40)) | ||
CRI API. The parameters for that function are defined by PodSandboxConfig ([code](https://github.com/kubernetes/cri-api/blob/release-1.28/pkg/apis/runtime/v1/api.pb.go#L1343)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need approval from sig node @SergeyKanzhelev , and from container runtimes , at least containerd and crio
My summary:
Per example, some user stories I heard for using Pods with an additional external interface was to implement bgp against an upstream router to implement fast failover, when I asked why they will not use probes they said that it was because they were slow, but they will prefer to use probes if they can operate in subseconds, so that can be solved with #3066 Another comment request that we never solved is Support port ranges or whole IPs in services #23864 , to be able to assign IPs to Pods, and then people has to add an external interface to the Pod for that.
|
Namespaces are not the only unit of logical partitioning in Kubernetes. For example, Pods are also partitioned by Node ( |
at the Service level, not at the pod level |
|
||
N/A | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some references to projects that implement a network oriented architecture in kubernetes and/or with containers, we should also understand the lessons learned from them and why, if these projects provide the same functionality, are not being used today? The alternatives section should consider them
https://github.com/openshift/kuryr-kubernetes
https://docs.docker.com/network/
https://kubevirt.io/user-guide/virtual_machines/interfaces_and_networks/
https://events19.linuxfoundation.cn/wp-content/uploads/2017/11/Multiple-Networks-and-Isolation-in-Kubernetes-_Michael-Xie.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are specific implementation, not sure how these are alternatives? Per @danwinship feedback I moved DRA discussion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if those specific implementation already solve this problem and are available and working, specially the kubevirt one, we need to understand why do we discard them. ... DRA is a piece of what you are proposing, you are proposing a full multinetowrk architecture to support kubernets as an IaaS ... kubevirt solves this problems and already works, we need to understand why kubevirt is not a valid option and why we need to make it in kubernetes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing main point of this KEP, we do not propose a solution for MN implementation but a common API all these implementation you listed could use. All of these have their own CRDs and ways to reference them in the Pod, this KEP is trying to standardize the reference aspect ONLY.
* For specific PodNetwork a Pod can either reference that PodNetwork or | ||
PodNetworkAttachment (referencing that PodNetwork), but not both at the same time | ||
* PodNetwork can be referenced by multiple Pods | ||
* PodNetworkAttachment can be referenced by multiple Pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In multi-tenancy situations, Pod dimension configuration is usually different.
So that the PodNetworkAttachment data may be 1:1 with the number of Pods.
This will cause a lot of overhead.
If PodNetworkAttachment is a single Pod dimension configuration, there is no need to be referenced by multiple Pods.
If the network definitions are different, multiple PodNetworks are used to plan the network plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to provide flexibility to the implementations. One can support 1:1 the other 1:n.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not regress in scalability and performance, @l1b0k question and comment is spot on, we need to understand the impact of this overhead, we can not sacrifice performance by flexibility without having a real understanding on how much penalty will cause this flexibility , and what is the benefit of this flexibility ... open a door for multiple implementations to behave different is not a good answer if this means users will have different behaviors on different places for the same cluster and APIs
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/lifecycle frozen |
@bowei: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |