-
Notifications
You must be signed in to change notification settings - Fork 45
CFP: 34841 BGP Route Learning #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| # CFP-34841: BGP Route Learning | ||
|
|
||
| **SIG: SIG-BGP, SIG-Datapath** | ||
|
|
||
| **Begin Design Discussion:** 2024-09-11 | ||
|
|
||
| **Cilium Release:** X.XX | ||
|
|
||
| **Authors:** David Swafford <dswafford@coreweave.com> | ||
|
|
||
| ## Summary | ||
|
|
||
| Cilium's Border Gateway Protocol implementation is one-directional today. Its intended use is for the advertisement of pod and service IP addresses/prefixes to the network. The network however is unable to influence Cilium's routing decisions using BGP or by any other means. Cilium at present expects a simple network design at the node level where all routing needs can be satisfied using a default route. The upstream router is assumed capable of making all routing decisions and to have connectivity to all necessary routing domains. | ||
|
|
||
| In advanced network designs, the administrator may desire to establish connectivity between Cilium and several isolated routing domains. For example, one routing domain may provide connectivity to all other Kubernetes nodes, another routing domain may provide connectivity only to the Internet, while a third may provide connectivity only to storage resources. On the network upstream of Cilium, these domains may be isolated using various techniques such as Virtual Routing and Forwarding (VRF) instances, and even physical isolation with dedicated physical network interface cards (NICs). | ||
|
|
||
| When configured for [Native-Routing](https://docs.cilium.io/en/stable/network/concepts/routing/#native-routing) mode, it appears possible to support the advanced network design described above. However, doing so in large environments would require complex management and automation for Linux routing tables. | ||
|
|
||
| Proposed here is a change to support route learning via BGP. | ||
|
|
||
| ## Motivation | ||
|
|
||
| ### User Stories | ||
|
|
||
|  | ||
|
|
||
| *As a cluster administrator, I wish to deploy Cilium on a node which connects to two or more routing domains. The deployment is considered a success when Cilium is capable of learning non-overlapping routes using BGP from peers across each routing domain, and when Cilium is able to route outbound traffic to the correct routing domain.* | ||
|
|
||
|
|
||
| https://github.com/cilium/cilium/pull/33035/ | ||
|
|
||
|
|
||
| ## Goals | ||
|
|
||
| * When Cilium is configured to use the Border Gateway Protocol (BGP) with one or more peers, routes advertised by those peers to Cilium are accepted, evaluated through BGP's Best Path Selection algorithm, and the resulting Best Path(s) installed into the node's dataplane. The net result is that Cilium makes Internet Protocol (IP) routing decisions for Pod/Service sourced traffic based on routes learned via BGP. | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| * Making Cilium's BGP implementation Virtual Routing and Forwarding (VRF) instance aware. | ||
|
|
||
|
|
||
| ## Proposal | ||
|
|
||
| ### Overview | ||
|
|
||
| ### Route Programming | ||
|
|
||
| Cilium supports two modes of routing -- [Encapsulation](https://docs.cilium.io/en/stable/network/concepts/routing/#encapsulation) and [Native](https://docs.cilium.io/en/stable/network/concepts/routing/#native-routing). | ||
|
|
||
| #### Native Routing Mode | ||
|
|
||
| When operating in Native Routing mode, the implementation appears straightforward. A mechanism is required to synchronize routes learned from BGP, specifically Cilium's in-memory instance of GoBGP, to the Linux routing table. For context, GoBGP may be deployed as a standalone daemon or instantiated directly through its Golang package. Cilium instantiates directly. | ||
|
|
||
| GoBGP on it's own is unable to install routes into the Linux routing table directly. GoBGP's documentation [suggests](https://github.com/osrg/gobgp/blob/master/docs/sources/zebra.md) running an additional BGP daemon such as Quagga or FRR and establishing communication between the two. | ||
|
|
||
| As the number of routes received, and rate of change may vary dramatically from one environment to another, it is the author's recommendation that one of the existing GoBGP supported-daemons be selected. Under this model, an additional daemon will be deployed within Cilium's BGP-speaking pods. For a Kubernetes-based deployment, these are the pods named `cilium-<hash>`. | ||
|
|
||
| For Kubernetes deployments, where Cilium runs within a container, the mechanism to synchronize routes to the Linux routing table needs synchronize NOT to the container's routing table but rather the underlying node's routing table. The author seeks guidance from the community on how to best approach this. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to explore some areas
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re: route hijacking, do we have any protection today? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we do not install BGP routes into kernel, this is not an issue for now. |
||
|
|
||
| #### Encapsulation Mode | ||
|
|
||
| The author requests guidance from the community to support this feature in Encapsulation mode. | ||
|
|
||
|
|
||
| ### BGP Import Policy | ||
| Under [this PR](https://github.com/cilium/cilium/pull/33035), I modified Cilium to reject all BGP paths advertised toward it. This proposal will revert the majority of those changes. | ||
|
|
||
|
|
||
| ### Custom Resource Definition Modifications | ||
|
|
||
| To mitigate the risk associated with route leaks, a new configuration option named `prefixLimit` will be added to `CiliumBGPPeeringPolicy` and `CiliumBGPPeerConfig`. Exceeding the configured prefix limit will result in the the BGP speaker being torn down. Under the covers, GoBGP supports additional configuration for this option as seen in [link](https://pkg.go.dev/github.com/osrg/gobgp/internal/pkg/config#PrefixLimitConfig). It may be desirable to expose an equivalent CRD configuration option for GoBGP's `RestartTimer`. | ||
|
|
||
|
|
||
| ## Impacts / Key Questions | ||
|
|
||
| _List crucial impacts and key questions. They likely require discussion and are required to understand the trade-offs of the CFP. During the lifecycle of a CFP, discussion on design aspects can be moved into this section. After reading through this section, it should be possible to understand any potentially negative or controversial impact of this CFP. It should also be possible to derive the key design questions: X vs Y._ | ||
|
|
||
| ### Impact: Cilium complexity | ||
|
|
||
| Implementation of this feature introduces a fair amount of complexity to Cilium. Complex systems are harder to reason about. Complexity introduced includes a second BGP daemon, manipulation of the node's Linux routing table, and additional configuration options. | ||
|
|
||
| Further, to support this feature in Encapsulation routing mode may require extensive changes outside of BGP-related source code. | ||
|
|
||
| ### Key Question: Should the feature be limited to Native-Routing mode? | ||
|
|
||
| When in Native-Routing mode, the scope of changes proposed are limited to the BGP-related source code and the introduction of a second BGP daemon. | ||
|
|
||
| ### Key Question: Does the introduction of the feature REQUIRE the administrator to advertise a default route via BGP? | ||
|
|
||
| With the existing behavior relying on a default route, one not learned via BGP, does the introduction of the feature remove the ability to use a default route? Does it require the adminstrator to announce a default route via BGP instead? | ||
|
|
||
| ### Option 1: Implement the Proposed Feature | ||
|
|
||
| #### Pros | ||
|
|
||
| * Unlocks new use-cases for Cilium with advanced network designs. | ||
|
|
||
| #### Cons | ||
|
|
||
| * Introduces additional complexity for both maintainers and administrators. Additional complexity for administrators resides in the fact that the underlying Linux routing table may change without notice in response to BGP routing changes. | ||
|
|
||
| ### Option 2: Do Nothing | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we do need to enhance Cilium's capability to install learned BGP routes. I'd add 3rd option in this CFP as well, that is to pass responsibility of installing kernel routes to Cilium itself instead of another process. There are few advantages to it
Cons
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had assumed that implementing RIB management directly into Cilium would deviate too far from its intended scope. If we were to invest the time to implement RIB programming, I'd want to ask "Why not invest that time into implementing it in GoBGP directly" instead? cc @YutaroHayakawa Aside from the pros/cons listed, would native RIB programming within Cilium unblock new features, or reduce existing tech debt somehow?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For me the answer for this question is "Eventually, we may want to go this direction, but it's too ambitious for now". I think there is a huge gap between having the generic RIB implementation for GoBGP and implementing a RIB specific to the Cilium's use case. To decouple the RIB implementation from Cilium, we need to come up with a stable interface between the RIB and Cilium. Designing such an interface is already hard. Putting them into a single Go binary would simplifies the implementation a lot. We can break the interface at any time as needed. Once our implementation become mature enough, we can always consider extracting our implementation to the independent project. However, we shouldn't set it as a goal from the beginning.
It allows us to program the Cilium's eBPF data plane. That leaves us the possibility to implement the data plane feature that doesn't exist in the Linux kernel with eBPF and integrate it with BGP. If we go with the GoBGP-oriented approach, it is hard to justify the support for such a specialized data plane. === As a one engineer, I agree with you. GoBGP should get a proper generic RIB and data plane manager implementation like Zebra (they tried it in the past but failed), but I guess it's hard to do that in the context of the current Cilium project.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| #### Pros | ||
|
|
||
| * Cilium remains simple to reason about. | ||
|
|
||
| #### Cons | ||
|
|
||
| * Cilium cannot be used in advanced network designs, potentially limiting Cilium adoption. | ||
|
|
||
| ## Future Milestones | ||
|
|
||
| _List things that this CFP will enable but that are out of scope for now. This can help understand the greater impact of a proposal without requiring to extend the scope of a CFP unnecessarily._ | ||
|
|
||
| ### Deferred Milestone 1 | ||
|
|
||
| Unknown at this time. | ||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cilium today supports multiple BGP instances. There are couple of options present, all Cilium BGP instances peer with single additional BGP daemon ( such as FRR, BIRD, or another gobgp+zebra ) which programs the learned routes into kernel. Or we create 1-1 additional BGP daemon and do peering over loopback IP + different port. 1-1 mapping sounds wrong in terms of scaling and potential conflicts while installing routes in kernel.
This seems bit counter intuitive
Few questions
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harsimran-pabla This is great feedback! Thank you for the depth of your responses.
For your first points, I agree with you. I would not want to introduce N+ instances of a BGP daemon. Your point about conflicts when programming the kernel routing table seems very likely.
Adjacent BGP Daemon for RIB Programming
When writing the proposal, I was modeling it after GoBGP's existing support -- a BGP daemon that is off to the side, but not inline. Your point about graceful restart is important to consider here. If the remote peer (the TOR) triggers a graceful restart, the downstream BGP daemon handling RIB programming would be unaware. The GoBGP instance running via Cilium would remain running. As long as Cilium's GoBGP instance remains running, the TOR's routes should remain in the adjacent BGP daemon. That is one direction. GR in the reverse seems complicated....
Inline BGP Daemon for RIB Programming
This is an interesting idea. Here, Cilium's BGP instance remains the furthest downstream from the network (a stub router). Cilium peers with an intermediary BGP daemon on the node. Under this model, would Cilium's BGP configuration reflect only the peering with the intermediary peer?
Do you have any perspective on how many users, and how large those Cilium installations are that have requested route learning in the past? If I were the only one, I could see pushing the complexity back on the user (myself). In my situation, we're more than capable of managing an additional daemon on each node. But, the pushback I expect, is that everything on the node is deployed and managed by k8s and CRDs. If I deploy and manage a separate BGP daemon, and it's now a dependency of Cilium, I would want to configure the additional daemon via CRDs as well. I believe I would then need to implement a copy of Cilium's BGP-related CRDs under a new CRD, with software to consume that CRD and program the additional BGP daemon.
This is probably a far easier path still than implementing RIB programming in Cilium directly. I'd want to explore more what possibilities are unlocked, or issues we remove, if Cilium owned RIB programming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be the case. Cilium will only peer with on-node BGP instance, which would be listening on localhost and some specific port number.
Not many, but few enough that we are considering this feature. This is mostly for advances topologies where multiple NICs are present on the server and there is some traffic-engineering going on.
There are few cases where I have seen users installing additional BGP process on the node for other reasons. This comes from the need to provision the node itself prior to kubernetes installation and advertise node loopback address into core network via BGP. This is not possible with Cilium BGP, since it requires node to be part of kubernetes cluster already.
If you go towards the installing your own BGP router on the node, I would recommend looking at it from this angle as well. Decoupling kubernetes from this BGP process might provide some benefits.