Skip to content

CFP: 34841 BGP Route Learning#58

Closed
dswaffordcw wants to merge 1 commit intocilium:mainfrom
dswaffordcw:cfp/34841-bgp-route-learning
Closed

CFP: 34841 BGP Route Learning#58
dswaffordcw wants to merge 1 commit intocilium:mainfrom
dswaffordcw:cfp/34841-bgp-route-learning

Conversation

@dswaffordcw
Copy link

This PR adds a CFP for BGP Route Learning (GH Issue cilium/cilium#34841).

Signed-off-by: David Swafford <dswafford@coreweave.com>
@joestringer
Copy link
Member

cc @cilium/sig-bgp


GoBGP on it's own is unable to install routes into the Linux routing table directly. GoBGP's documentation [suggests](https://github.com/osrg/gobgp/blob/master/docs/sources/zebra.md) running an additional BGP daemon such as Quagga or FRR and establishing communication between the two.

As the number of routes received, and rate of change may vary dramatically from one environment to another, it is the author's recommendation that one of the existing GoBGP supported-daemons be selected. Under this model, an additional daemon will be deployed within Cilium's BGP-speaking pods. For a Kubernetes-based deployment, these are the pods named `cilium-<hash>`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cilium today supports multiple BGP instances. There are couple of options present, all Cilium BGP instances peer with single additional BGP daemon ( such as FRR, BIRD, or another gobgp+zebra ) which programs the learned routes into kernel. Or we create 1-1 additional BGP daemon and do peering over loopback IP + different port. 1-1 mapping sounds wrong in terms of scaling and potential conflicts while installing routes in kernel.

This seems bit counter intuitive

  • Additiionl BGP daemon -- Cilium BGP instances -- upstream routers.
  • Alternatively, this might be better : cilium BGP instances -- Node BGP daemon (installs kernel routes)-- upstream routers

Few questions

  1. With this design idea, why do we need to bundle additional BGP daemon into Cilium. To achieve similar result, we can have user controlled BGP daemon deployed on the node, which peers with upstream routers ( TORs/Core routers ). And Cilium peers with this BGP router on the node. Yes, there is additional complexity of managing this BGP daemon on the node, lifecycle of this daemon will have to be independent of kubernetes.
  2. GoBGP and zebra integration require additional testing, this maintenance burden will come to Cilium if we package them together and publish it as a bundle. How do we go about its maintenance.

Copy link
Author

@dswaffordcw dswaffordcw Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harsimran-pabla This is great feedback! Thank you for the depth of your responses.

For your first points, I agree with you. I would not want to introduce N+ instances of a BGP daemon. Your point about conflicts when programming the kernel routing table seems very likely.

Adjacent BGP Daemon for RIB Programming
When writing the proposal, I was modeling it after GoBGP's existing support -- a BGP daemon that is off to the side, but not inline. Your point about graceful restart is important to consider here. If the remote peer (the TOR) triggers a graceful restart, the downstream BGP daemon handling RIB programming would be unaware. The GoBGP instance running via Cilium would remain running. As long as Cilium's GoBGP instance remains running, the TOR's routes should remain in the adjacent BGP daemon. That is one direction. GR in the reverse seems complicated....

Inline BGP Daemon for RIB Programming
This is an interesting idea. Here, Cilium's BGP instance remains the furthest downstream from the network (a stub router). Cilium peers with an intermediary BGP daemon on the node. Under this model, would Cilium's BGP configuration reflect only the peering with the intermediary peer?

Do you have any perspective on how many users, and how large those Cilium installations are that have requested route learning in the past? If I were the only one, I could see pushing the complexity back on the user (myself). In my situation, we're more than capable of managing an additional daemon on each node. But, the pushback I expect, is that everything on the node is deployed and managed by k8s and CRDs. If I deploy and manage a separate BGP daemon, and it's now a dependency of Cilium, I would want to configure the additional daemon via CRDs as well. I believe I would then need to implement a copy of Cilium's BGP-related CRDs under a new CRD, with software to consume that CRD and program the additional BGP daemon.

This is probably a far easier path still than implementing RIB programming in Cilium directly. I'd want to explore more what possibilities are unlocked, or issues we remove, if Cilium owned RIB programming.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under this model, would Cilium's BGP configuration reflect only the peering with the intermediary peer?

Yes, that would be the case. Cilium will only peer with on-node BGP instance, which would be listening on localhost and some specific port number.

Do you have any perspective on how many users, and how large those Cilium installations are that have requested route learning in the past?

Not many, but few enough that we are considering this feature. This is mostly for advances topologies where multiple NICs are present on the server and there is some traffic-engineering going on.

If I deploy and manage a separate BGP daemon, and it's now a dependency of Cilium, I would want to configure the additional daemon via CRDs as well.

There are few cases where I have seen users installing additional BGP process on the node for other reasons. This comes from the need to provision the node itself prior to kubernetes installation and advertise node loopback address into core network via BGP. This is not possible with Cilium BGP, since it requires node to be part of kubernetes cluster already.

If you go towards the installing your own BGP router on the node, I would recommend looking at it from this angle as well. Decoupling kubernetes from this BGP process might provide some benefits.


As the number of routes received, and rate of change may vary dramatically from one environment to another, it is the author's recommendation that one of the existing GoBGP supported-daemons be selected. Under this model, an additional daemon will be deployed within Cilium's BGP-speaking pods. For a Kubernetes-based deployment, these are the pods named `cilium-<hash>`.

For Kubernetes deployments, where Cilium runs within a container, the mechanism to synchronize routes to the Linux routing table needs synchronize NOT to the container's routing table but rather the underlying node's routing table. The author seeks guidance from the community on how to best approach this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explore some areas

  1. How should be graceful restart configured in such deployment.
  2. How do we protect the node and cluster from route hijacking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: route hijacking, do we have any protection today?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not install BGP routes into kernel, this is not an issue for now.


* Introduces additional complexity for both maintainers and administrators. Additional complexity for administrators resides in the fact that the underlying Linux routing table may change without notice in response to BGP routing changes.

### Option 2: Do Nothing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need to enhance Cilium's capability to install learned BGP routes. I'd add 3rd option in this CFP as well, that is to pass responsibility of installing kernel routes to Cilium itself instead of another process.

There are few advantages to it
Pros :

  • Lifecycle and management of BGP on the node is via Cilium. ( Although in some scenarios this might be a drawback as well )
  • Tighter control over what gets installed in kernel and with which priority ( admin-distance ) when there are multiple sources of the route ( BGP instances, or other Cilium features such as auto-direct-node-routing )

Cons

  • Complex to implement ( Essentially we have to implement RIB engine inside Cilium ).
  • Cilium would need to sync kernel routes on restarts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that implementing RIB management directly into Cilium would deviate too far from its intended scope. If we were to invest the time to implement RIB programming, I'd want to ask "Why not invest that time into implementing it in GoBGP directly" instead? cc @YutaroHayakawa

Aside from the pros/cons listed, would native RIB programming within Cilium unblock new features, or reduce existing tech debt somehow?

Copy link
Member

@YutaroHayakawa YutaroHayakawa Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Why not invest that time into implementing it in GoBGP directly"

For me the answer for this question is "Eventually, we may want to go this direction, but it's too ambitious for now". I think there is a huge gap between having the generic RIB implementation for GoBGP and implementing a RIB specific to the Cilium's use case.

To decouple the RIB implementation from Cilium, we need to come up with a stable interface between the RIB and Cilium. Designing such an interface is already hard. Putting them into a single Go binary would simplifies the implementation a lot. We can break the interface at any time as needed.

Once our implementation become mature enough, we can always consider extracting our implementation to the independent project. However, we shouldn't set it as a goal from the beginning.

Aside from the pros/cons listed, would native RIB programming within Cilium unblock new features, or reduce existing tech debt somehow?

It allows us to program the Cilium's eBPF data plane. That leaves us the possibility to implement the data plane feature that doesn't exist in the Linux kernel with eBPF and integrate it with BGP. If we go with the GoBGP-oriented approach, it is hard to justify the support for such a specialized data plane.

===

As a one engineer, I agree with you. GoBGP should get a proper generic RIB and data plane manager implementation like Zebra (they tried it in the past but failed), but I guess it's hard to do that in the context of the current Cilium project.

Copy link
Author

@dswaffordcw dswaffordcw Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dswaffordcw
Copy link
Author

Hi @harsimran-pabla @YutaroHayakawa @rastislavs

I presented this proposal last week to several of the Datapath maintainers (@joestringer, Jordan Rife, Hemanth Malla, Bowei Du) at the Cilium Developer's Summit, which took place alongside KubeCon. Slides:
SwaffordBGPNov2024.pdf

The feedback was unanimous -- in opposition.

I focused the discussion on your last point @harsimran-pabla regarding what benefits might be unlocked by bringing RIB programming and more generically routing decisions directly into the Dataplane.

Notes:

  • I learned that the Dataplane is unaware of routes/routing decisions today, and defers those decisions entirely to the Kernel routing table.

Concerns raised:

  • Bringing routing decisions into eBPF introduces a significant risk regarding complexity and potential bugs, compared to the mature implementation of the Kernel routing.
  • There were no specific features beyond basic routing enabled by this change. I thought from the original discussion here that one or more security features may have benefited from having routing knowledge native to Cilium -- but we could not come up with any examples.
  • The fact that none of the major cloud providers use BGP with Cilium. CoreWeave's decision to use BGP with Cilium makes us a niche use case.

I then switched focus to my original proposal of packaging up an off-the-shelf BGP daemon as part of Cilium, and wiring up configuration to make it transparent. We revisited the in-line and parallel BGP-speaker deployment models. Bowei Du suggested that this could easily be accomplished by making BIRD, etc. part of the Cilium deployment-specific Helm chart. Until that day, I was envisioning the need to manage BIRD, etc., independent of K8s. @harsimran-pabla Talked about a specific example where that might even be a requirement. For CoreWeave's use-case, we don't have a dependency on BGP prior to K8s. I couldn't think of a strong argument against that model, and it reminded me that the configuration for BGP peers would be deterministic anyway.

@harsimran-pabla I know you mentioned before that BGP Route Learning was a feature your group was interested in offering. I think Ratislav also mentioned that on one of the GitHub issues. With this feedback, I'm curious to hear what your thought? From my side, I plan to test out the co-habitated model of adding BIRD to the Cilium K8s deployment.

@YutaroHayakawa
Copy link
Member

YutaroHayakawa commented Nov 19, 2024

I then switched focus to my original proposal of packaging up an off-the-shelf BGP daemon as part of Cilium, and wiring up configuration to make it transparent. We revisited the in-line and parallel BGP-speaker deployment models. Bowei Du suggested that this could easily be accomplished by making BIRD, etc. part of the Cilium deployment-specific Helm chart.

Hmm, in this case, users can just deploy their own DaemonSet instead of packaging it as a part of Cilium chart, no? So that they can choose whatever implementation they want FRR, Bird, any proprietary router. What kind of the value we can offer by packaging it? Could you elaborate more about and wiring up configuration to make it transparent?

@dswaffordcw
Copy link
Author

@YutaroHayakawa What I had in mind would be similar to how GoBGP is used under the covers, yet the user does not need to think or be aware of it.

Cilium is already quite complex. Instead of asking the user to think about and manage the deployment separately, what I had in mind was this:

When Cilium is started with BGP enabled (bgpControlPlane.enabled), and when BGP is configured (by the presence of the Custom ResourceCiliumBGPClusterConfig):

  1. a BIRD configuration is internally generated from CiliumBGPClusterConfig. Instead of configuring GoBGP peers from CiliumBGPClusterConfig, the peers defined are instead configured on the BIRD instance. Cilium's GoBGP instance(s) would peer with BIRD. BIRD would peer with the outside network.
  2. a container running BIRD is started in Cilium's namespace
  3. the configuration generated in step 1 is loaded into BIRD container at startup
  4. Cilium's GoBGP configuration contains a single peer, one that is internally configured, to the internal BIRD container.

The cilium bgp status and cilium bgp routes CLI commands would be altered to render status from the internal BIRD container instead of Cilium's GoBGP instance, and/or extended to provide visibility to both.

In this approach, the user does not need to think about or configure BIRD. That is now an implementation detail managed by Cilium. The purpose of the BIRD instance is to be a peer in-between Cilium and the network, for the sole purpose of performing RIB programming and relaying Cilium-generated routes.

Now, what I described may be a nightmare to implement. I'll admit, I am fairly new to this side of Kubernetes. Dynamically starting a container with BIRD based on the Cilium's configuration bgpControlPlane.enabled alone may not be possible. In that case, we could always spin up an instance of BIRD but that may be wasteful.

@YutaroHayakawa
Copy link
Member

YutaroHayakawa commented Nov 19, 2024

Thanks for your explanation! So, basically your idea is configuring BIRD with CiliumBGPClusterConfig instead of GoBGP, but the BIRD still peers with the local GoBGP to get the route correct?

I understand your point that with the colocated BGP speaker setup, we need another orchestration for BIRD and you don't want to orchestrate two different BGP speakers. However, your idea sounds almost identical to introduce another backend BGP speaker (BIRD) to BGP Control Plane 😅. Your idea a BIRD configuration is internally generated from CiliumBGPClusterConfig is essentially what BGP Control Plane does for GoBGP.

To be fair, we tried to make BGPv2 API speaker-agnostic and we even have an abstraction layer for that hides implementation details of the underlying BGP speaker (https://github.com/cilium/cilium/blob/main/pkg/bgpv1/agent/routermgr.go). However, I'd say we're surely relying on the GoBGP's behavior implicitly, so you'll spend hard time to deal with the this abstraction.

Also, I'm not sure we're ready for introducing another speaker since it doubles the maintenance cost. For example, when we introduce any new feature, we need to make sure it works for both speakers. I'm not sure if it's worth having that cost just for importing the route.

Dynamically starting a container with BIRD based on the Cilium's configuration bgpControlPlane.enabled alone may not be possible. In that case, we could always spin up an instance of BIRD but that may be wasteful.

This is possible with Helm's subchart feature for example. You can maintain your own parent Chart that packages Cilium's Chart and you can refer to the Cilium's Helm values and conditionally render the BIRD DaemonSet.

This still doesn't solve the problem of two orchestrators. You still need to render the BIRD's configuration by yourself, but I'd say it's much easier than adopting BGP Control Plane APIs. You don't need to deal with supporting all the configuration knobs of BGP Control Plane APIs unnecessary for you.

@joestringer
Copy link
Member

It was great to meet you last week @dswaffordcw , this idea to also perform route imports in addition to the route exports is an interesting use case to consider, as it also makes different assumptions about the underlying network than the existing use cases.

At least from my perspective, I'm not yet fully sure about the scope of desired changes for route learning and how far into implementing BGP we would need to go, or how much abstraction we would need to build in order to grant users the flexibility they'd like in route configuration. Considering the bird configuration for BGP references 35 RFCs and has dozens of pages of configuration guide, it's a little hard to understand what is "good enough" for an abstraction if we extend the Cilium APIs to solve this use case. Probably we need the proposal to get a bit more concrete in order to explore that question.

I thought that as an initial step, it could help to deploy a BIRD sidecar with Cilium, which would ideally provide the capabilities without too much development effort. Often folks who come to Cilium looking for BGP are already well familiar with existing BGP daemons, so in a sense I figured that this would minimize the change in configuration language for users already familiar with BGP. An example issue/discussion/blog in the community with the full configuration / steps could also help to demonstrate the use case and help us figure out exactly how much of the defined BGP functionality we are looking to implement in Cilium APIs. By itself this could already be useful, even if the usability is not yet ideal. That could then serve as a basis for furthering this CFP.

I do want to leave open the possibility to go down the track of extending Cilium BGP APIs for this use case, as long as we feel that there is sufficient interest and active maintenance involvement from the Cilium development community. Though I would defer to @YutaroHayakawa , @harsimran-pabla and @rastislavs and the other folks involved in those APIs to provide the technical guidance around integrating the functionality into the Cilium APIs.

As for having a Cilium abstraction programming into BIRD, I'm not sure whether it would make sense to try to build/maintain that functionality in Cilium since we're already maintaining a GoBGP-based backend. I understood from the discussion last week that GoBGP is currently missing the logic to program Linux routing tables, but this is seems like a solvable problem. I see GoBGP already imports vishvananda/netlink for kernel API interactions for instance. Given where Cilium developers are currently spending effort, this seems like it would be the best aligned to keep the architecture as simple as possible.

@dswaffordcw
Copy link
Author

It was nice to meet you as well @joestringer! I really enjoyed your developer summit.

Thank you Joe, and also @YutaroHayakawa your detailed feedback. I agree. At this point, it seems best to experiment further with the co-deployment options and withdraw the proposal for the time being.

If you come across people requesting this functionality in the future, please direct them to this CFP. I'd love to hear from others who are interested about their use case.

@isac
Copy link

isac commented Feb 12, 2025

Let me describe our use-case, as you asked for others interested in this @dswaffordcw

We recently encountered this issue (one-directional BGP) in one of our new on-premise deployments. After evaluating Cilium and similar products, we chose Cilium for its native BGP support. Our use case involves a kubernetes cluster that spans multiple data centers and requires bi-directional BGP for external integrations. Unfortunately, we didn’t fully understand the specifics of Cilium’s BGP capabilities beforehand.

We’ve been running Cilium for a while and observed that routes are both received and advertised using the “cilium bgp peers” command. However, as we discovered—far too late— the number of received routes doesn’t actually indicate proper functionality.
image

We would of course love to have bi-directional support, but as I understand from the communication above, we'll look into other options for now.

@dswaffordcw
Copy link
Author

dswaffordcw commented Feb 13, 2025

Let me describe our use-case, as you asked for others interested in this @dswaffordcw

We recently encountered this issue (one-directional BGP) in one of our new on-premise deployments. After evaluating Cilium and similar products, we chose Cilium for its native BGP support. Our use case involves a kubernetes cluster that spans multiple data centers and requires bi-directional BGP for external integrations. Unfortunately, we didn’t fully understand the specifics of Cilium’s BGP capabilities beforehand.

We’ve been running Cilium for a while and observed that routes are both received and advertised using the “cilium bgp peers” command. However, as we discovered—far too late— the number of received routes doesn’t actually indicate proper functionality. image

We would of course love to have bi-directional support, but as I understand from the communication above, we'll look into other options for now.

Hi @isac,

One idea that was suggested to me when I presented this to a few of the maintainers was that I could accomplish the same with far less complexity by inserting a BGP daemon inline between Cilium and the external network. I haven't tried that yet, as I solved my original need with a different custom solution.

But for your case, that option might be interesting to look at?

The way it would work, is that you would add a BIRD daemon either as part of Cilium's daemonset, or as a standalone one on the same node.

Your Cilium CRD would be configured to peer with the instance of BIRD on the local host, not the external network. Then, on BIRD, you'd configure peering with the external network. The external network (Top of Rack switch, etc.) would be configured to peer only with the BIRD instance.

This would result in Cilium advertising its routes to the local BIRD instance, which would in turn advertise them to the upstream network. As BIRD can program the kernel's routing table, routes received by it would be reflected on the hosts' routing table.

To go down this path, you would need to configure Cilium in Native Routing Mode (not Encapsulated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants