Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions cilium/CFP-34841-bgp-route-learning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# CFP-34841: BGP Route Learning

**SIG: SIG-BGP, SIG-Datapath**

**Begin Design Discussion:** 2024-09-11

**Cilium Release:** X.XX

**Authors:** David Swafford <dswafford@coreweave.com>

## Summary

Cilium's Border Gateway Protocol implementation is one-directional today. Its intended use is for the advertisement of pod and service IP addresses/prefixes to the network. The network however is unable to influence Cilium's routing decisions using BGP or by any other means. Cilium at present expects a simple network design at the node level where all routing needs can be satisfied using a default route. The upstream router is assumed capable of making all routing decisions and to have connectivity to all necessary routing domains.

In advanced network designs, the administrator may desire to establish connectivity between Cilium and several isolated routing domains. For example, one routing domain may provide connectivity to all other Kubernetes nodes, another routing domain may provide connectivity only to the Internet, while a third may provide connectivity only to storage resources. On the network upstream of Cilium, these domains may be isolated using various techniques such as Virtual Routing and Forwarding (VRF) instances, and even physical isolation with dedicated physical network interface cards (NICs).

When configured for [Native-Routing](https://docs.cilium.io/en/stable/network/concepts/routing/#native-routing) mode, it appears possible to support the advanced network design described above. However, doing so in large environments would require complex management and automation for Linux routing tables.

Proposed here is a change to support route learning via BGP.

## Motivation

### User Stories

![BGP Peering Diagram](./images/34841-bgp-route-learning.png)

*As a cluster administrator, I wish to deploy Cilium on a node which connects to two or more routing domains. The deployment is considered a success when Cilium is capable of learning non-overlapping routes using BGP from peers across each routing domain, and when Cilium is able to route outbound traffic to the correct routing domain.*


https://github.com/cilium/cilium/pull/33035/


## Goals

* When Cilium is configured to use the Border Gateway Protocol (BGP) with one or more peers, routes advertised by those peers to Cilium are accepted, evaluated through BGP's Best Path Selection algorithm, and the resulting Best Path(s) installed into the node's dataplane. The net result is that Cilium makes Internet Protocol (IP) routing decisions for Pod/Service sourced traffic based on routes learned via BGP.

## Non-Goals

* Making Cilium's BGP implementation Virtual Routing and Forwarding (VRF) instance aware.


## Proposal

### Overview

### Route Programming

Cilium supports two modes of routing -- [Encapsulation](https://docs.cilium.io/en/stable/network/concepts/routing/#encapsulation) and [Native](https://docs.cilium.io/en/stable/network/concepts/routing/#native-routing).

#### Native Routing Mode

When operating in Native Routing mode, the implementation appears straightforward. A mechanism is required to synchronize routes learned from BGP, specifically Cilium's in-memory instance of GoBGP, to the Linux routing table. For context, GoBGP may be deployed as a standalone daemon or instantiated directly through its Golang package. Cilium instantiates directly.

GoBGP on it's own is unable to install routes into the Linux routing table directly. GoBGP's documentation [suggests](https://github.com/osrg/gobgp/blob/master/docs/sources/zebra.md) running an additional BGP daemon such as Quagga or FRR and establishing communication between the two.

As the number of routes received, and rate of change may vary dramatically from one environment to another, it is the author's recommendation that one of the existing GoBGP supported-daemons be selected. Under this model, an additional daemon will be deployed within Cilium's BGP-speaking pods. For a Kubernetes-based deployment, these are the pods named `cilium-<hash>`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cilium today supports multiple BGP instances. There are couple of options present, all Cilium BGP instances peer with single additional BGP daemon ( such as FRR, BIRD, or another gobgp+zebra ) which programs the learned routes into kernel. Or we create 1-1 additional BGP daemon and do peering over loopback IP + different port. 1-1 mapping sounds wrong in terms of scaling and potential conflicts while installing routes in kernel.

This seems bit counter intuitive

  • Additiionl BGP daemon -- Cilium BGP instances -- upstream routers.
  • Alternatively, this might be better : cilium BGP instances -- Node BGP daemon (installs kernel routes)-- upstream routers

Few questions

  1. With this design idea, why do we need to bundle additional BGP daemon into Cilium. To achieve similar result, we can have user controlled BGP daemon deployed on the node, which peers with upstream routers ( TORs/Core routers ). And Cilium peers with this BGP router on the node. Yes, there is additional complexity of managing this BGP daemon on the node, lifecycle of this daemon will have to be independent of kubernetes.
  2. GoBGP and zebra integration require additional testing, this maintenance burden will come to Cilium if we package them together and publish it as a bundle. How do we go about its maintenance.

Copy link
Author

@dswaffordcw dswaffordcw Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harsimran-pabla This is great feedback! Thank you for the depth of your responses.

For your first points, I agree with you. I would not want to introduce N+ instances of a BGP daemon. Your point about conflicts when programming the kernel routing table seems very likely.

Adjacent BGP Daemon for RIB Programming
When writing the proposal, I was modeling it after GoBGP's existing support -- a BGP daemon that is off to the side, but not inline. Your point about graceful restart is important to consider here. If the remote peer (the TOR) triggers a graceful restart, the downstream BGP daemon handling RIB programming would be unaware. The GoBGP instance running via Cilium would remain running. As long as Cilium's GoBGP instance remains running, the TOR's routes should remain in the adjacent BGP daemon. That is one direction. GR in the reverse seems complicated....

Inline BGP Daemon for RIB Programming
This is an interesting idea. Here, Cilium's BGP instance remains the furthest downstream from the network (a stub router). Cilium peers with an intermediary BGP daemon on the node. Under this model, would Cilium's BGP configuration reflect only the peering with the intermediary peer?

Do you have any perspective on how many users, and how large those Cilium installations are that have requested route learning in the past? If I were the only one, I could see pushing the complexity back on the user (myself). In my situation, we're more than capable of managing an additional daemon on each node. But, the pushback I expect, is that everything on the node is deployed and managed by k8s and CRDs. If I deploy and manage a separate BGP daemon, and it's now a dependency of Cilium, I would want to configure the additional daemon via CRDs as well. I believe I would then need to implement a copy of Cilium's BGP-related CRDs under a new CRD, with software to consume that CRD and program the additional BGP daemon.

This is probably a far easier path still than implementing RIB programming in Cilium directly. I'd want to explore more what possibilities are unlocked, or issues we remove, if Cilium owned RIB programming.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under this model, would Cilium's BGP configuration reflect only the peering with the intermediary peer?

Yes, that would be the case. Cilium will only peer with on-node BGP instance, which would be listening on localhost and some specific port number.

Do you have any perspective on how many users, and how large those Cilium installations are that have requested route learning in the past?

Not many, but few enough that we are considering this feature. This is mostly for advances topologies where multiple NICs are present on the server and there is some traffic-engineering going on.

If I deploy and manage a separate BGP daemon, and it's now a dependency of Cilium, I would want to configure the additional daemon via CRDs as well.

There are few cases where I have seen users installing additional BGP process on the node for other reasons. This comes from the need to provision the node itself prior to kubernetes installation and advertise node loopback address into core network via BGP. This is not possible with Cilium BGP, since it requires node to be part of kubernetes cluster already.

If you go towards the installing your own BGP router on the node, I would recommend looking at it from this angle as well. Decoupling kubernetes from this BGP process might provide some benefits.


For Kubernetes deployments, where Cilium runs within a container, the mechanism to synchronize routes to the Linux routing table needs synchronize NOT to the container's routing table but rather the underlying node's routing table. The author seeks guidance from the community on how to best approach this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explore some areas

  1. How should be graceful restart configured in such deployment.
  2. How do we protect the node and cluster from route hijacking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: route hijacking, do we have any protection today?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not install BGP routes into kernel, this is not an issue for now.


#### Encapsulation Mode

The author requests guidance from the community to support this feature in Encapsulation mode.


### BGP Import Policy
Under [this PR](https://github.com/cilium/cilium/pull/33035), I modified Cilium to reject all BGP paths advertised toward it. This proposal will revert the majority of those changes.


### Custom Resource Definition Modifications

To mitigate the risk associated with route leaks, a new configuration option named `prefixLimit` will be added to `CiliumBGPPeeringPolicy` and `CiliumBGPPeerConfig`. Exceeding the configured prefix limit will result in the the BGP speaker being torn down. Under the covers, GoBGP supports additional configuration for this option as seen in [link](https://pkg.go.dev/github.com/osrg/gobgp/internal/pkg/config#PrefixLimitConfig). It may be desirable to expose an equivalent CRD configuration option for GoBGP's `RestartTimer`.


## Impacts / Key Questions

_List crucial impacts and key questions. They likely require discussion and are required to understand the trade-offs of the CFP. During the lifecycle of a CFP, discussion on design aspects can be moved into this section. After reading through this section, it should be possible to understand any potentially negative or controversial impact of this CFP. It should also be possible to derive the key design questions: X vs Y._

### Impact: Cilium complexity

Implementation of this feature introduces a fair amount of complexity to Cilium. Complex systems are harder to reason about. Complexity introduced includes a second BGP daemon, manipulation of the node's Linux routing table, and additional configuration options.

Further, to support this feature in Encapsulation routing mode may require extensive changes outside of BGP-related source code.

### Key Question: Should the feature be limited to Native-Routing mode?

When in Native-Routing mode, the scope of changes proposed are limited to the BGP-related source code and the introduction of a second BGP daemon.

### Key Question: Does the introduction of the feature REQUIRE the administrator to advertise a default route via BGP?

With the existing behavior relying on a default route, one not learned via BGP, does the introduction of the feature remove the ability to use a default route? Does it require the adminstrator to announce a default route via BGP instead?

### Option 1: Implement the Proposed Feature

#### Pros

* Unlocks new use-cases for Cilium with advanced network designs.

#### Cons

* Introduces additional complexity for both maintainers and administrators. Additional complexity for administrators resides in the fact that the underlying Linux routing table may change without notice in response to BGP routing changes.

### Option 2: Do Nothing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need to enhance Cilium's capability to install learned BGP routes. I'd add 3rd option in this CFP as well, that is to pass responsibility of installing kernel routes to Cilium itself instead of another process.

There are few advantages to it
Pros :

  • Lifecycle and management of BGP on the node is via Cilium. ( Although in some scenarios this might be a drawback as well )
  • Tighter control over what gets installed in kernel and with which priority ( admin-distance ) when there are multiple sources of the route ( BGP instances, or other Cilium features such as auto-direct-node-routing )

Cons

  • Complex to implement ( Essentially we have to implement RIB engine inside Cilium ).
  • Cilium would need to sync kernel routes on restarts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that implementing RIB management directly into Cilium would deviate too far from its intended scope. If we were to invest the time to implement RIB programming, I'd want to ask "Why not invest that time into implementing it in GoBGP directly" instead? cc @YutaroHayakawa

Aside from the pros/cons listed, would native RIB programming within Cilium unblock new features, or reduce existing tech debt somehow?

Copy link
Member

@YutaroHayakawa YutaroHayakawa Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Why not invest that time into implementing it in GoBGP directly"

For me the answer for this question is "Eventually, we may want to go this direction, but it's too ambitious for now". I think there is a huge gap between having the generic RIB implementation for GoBGP and implementing a RIB specific to the Cilium's use case.

To decouple the RIB implementation from Cilium, we need to come up with a stable interface between the RIB and Cilium. Designing such an interface is already hard. Putting them into a single Go binary would simplifies the implementation a lot. We can break the interface at any time as needed.

Once our implementation become mature enough, we can always consider extracting our implementation to the independent project. However, we shouldn't set it as a goal from the beginning.

Aside from the pros/cons listed, would native RIB programming within Cilium unblock new features, or reduce existing tech debt somehow?

It allows us to program the Cilium's eBPF data plane. That leaves us the possibility to implement the data plane feature that doesn't exist in the Linux kernel with eBPF and integrate it with BGP. If we go with the GoBGP-oriented approach, it is hard to justify the support for such a specialized data plane.

===

As a one engineer, I agree with you. GoBGP should get a proper generic RIB and data plane manager implementation like Zebra (they tried it in the past but failed), but I guess it's hard to do that in the context of the current Cilium project.

Copy link
Author

@dswaffordcw dswaffordcw Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


#### Pros

* Cilium remains simple to reason about.

#### Cons

* Cilium cannot be used in advanced network designs, potentially limiting Cilium adoption.

## Future Milestones

_List things that this CFP will enable but that are out of scope for now. This can help understand the greater impact of a proposal without requiring to extend the scope of a CFP unnecessarily._

### Deferred Milestone 1

Unknown at this time.
Binary file added cilium/images/34841-bgp-route-learning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.