Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
fa282e7
Enhances EIP status validation
trozet Sep 15, 2025
e384b37
Add OKEP for ovn-kubernetes-mcp repo
tssurya Aug 17, 2025
026546f
Add OKEP to website
tssurya Sep 24, 2025
f5931bf
Merge pull request #5496 from tssurya/okep-mcp-server-troubleshooter
tssurya Sep 24, 2025
b20fb84
Merge pull request #5577 from trozet/fix_egress_ip_status_handling
trozet Sep 25, 2025
3b22f41
OKEP-5552: Add support for UDN node selector
trozet Sep 2, 2025
090d4d8
api, udn: subnets must be masked
maiqueb Sep 29, 2025
3450b35
api, l2 udn, tests: mask the subnet of the preconfigured UDN
maiqueb Sep 30, 2025
2de4d0f
Merge pull request #5553 from trozet/udn_node_selector
trozet Sep 30, 2025
60404e5
udn host->ovn flows needs to be vlan aware
cathy-zhou Sep 1, 2025
47ed714
Fix EgressIP unit test to verify cache-based allocation stability
andreaskaris Sep 30, 2025
82209bc
api, l2 udn, tests: add e2e tests to assert API errors are caught
maiqueb Sep 30, 2025
a996442
make primary UDN node mode aware
cathy-zhou Sep 2, 2025
152d434
Merge pull request #5608 from andreaskaris/issue5607
trozet Oct 2, 2025
ae113e7
Merge pull request #5588 from maiqueb/udn-require-masked-subnets
trozet Oct 3, 2025
0b248a2
Merge pull request #5554 from cathy-zhou/upstream_udn_fix
trozet Oct 3, 2025
434b48f
kubevirt e2e: use a value for vm nodeselector
jcaamano Oct 2, 2025
4c34982
kubevirt: fix bad release of IPs of live migratable pods
jcaamano Oct 3, 2025
7a155cc
kubevirt: prevent error log on IP release
jcaamano Oct 3, 2025
0dc8f27
kubevirt: fix search of colliding pods for migrated pods
jcaamano Oct 3, 2025
c1b02b5
kubevirt: test OVN DB after completion of source pod
jcaamano Oct 3, 2025
ef92f78
kubevirt: test with per-pod SNATs
jcaamano Oct 3, 2025
442e32c
add lint target to run golanci natively
jluhrsen Sep 30, 2025
8e457b1
Merge branch 'pr/5617' into lint_fix_ds
kyrtapz Oct 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions dist/templates/k8s.ovn.org_clusteruserdefinednetworks.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,14 @@ spec:
rule: '!has(self.infrastructureSubnets) || !has(self.reservedSubnets)
|| self.infrastructureSubnets.all(infra, !self.reservedSubnets.exists(reserved,
cidr(infra).containsCIDR(reserved) || cidr(reserved).containsCIDR(infra)))'
- message: infrastructureSubnets must be a masked network address
(no host bits set)
rule: '!has(self.infrastructureSubnets) || self.infrastructureSubnets.all(s,
isCIDR(s) && cidr(s) == cidr(s).masked())'
- message: reservedSubnets must be a masked network address (no
host bits set)
rule: '!has(self.reservedSubnets) || self.reservedSubnets.all(s,
isCIDR(s) && cidr(s) == cidr(s).masked())'
layer3:
description: Layer3 is the Layer3 topology configuration.
properties:
Expand Down
8 changes: 8 additions & 0 deletions dist/templates/k8s.ovn.org_userdefinednetworks.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,14 @@ spec:
rule: '!has(self.infrastructureSubnets) || !has(self.reservedSubnets)
|| self.infrastructureSubnets.all(infra, !self.reservedSubnets.exists(reserved,
cidr(infra).containsCIDR(reserved) || cidr(reserved).containsCIDR(infra)))'
- message: infrastructureSubnets must be a masked network address
(no host bits set)
rule: '!has(self.infrastructureSubnets) || self.infrastructureSubnets.all(s,
isCIDR(s) && cidr(s) == cidr(s).masked())'
- message: reservedSubnets must be a masked network address (no host
bits set)
rule: '!has(self.reservedSubnets) || self.reservedSubnets.all(s,
isCIDR(s) && cidr(s) == cidr(s).masked())'
layer3:
description: Layer3 is the Layer3 topology configuration.
properties:
Expand Down
798 changes: 798 additions & 0 deletions docs/okeps/okep-5494-ovn-kubernetes-mcp-server.md

Large diffs are not rendered by default.

205 changes: 205 additions & 0 deletions docs/okeps/okep-5552-dynamic-udn-node-allocation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# OKEP-5552: Dynamic UDN Node Allocation

* Issue: [#5552](https://github.com/ovn-org/ovn-kubernetes/issues/5552)

## Problem Statement

When scaling UDNs, the control-plane cost of rendering a topology is high. This is the core limiting factor to
being able to scale to 1000s of UDNs. While there are plans to also improve network controller performance with UDNs,
there is still valuable savings to be had by not rendering UDNs on nodes where they are not needed.

An example use case where this makes sense is when a Kubernetes cluster has its node resources segmented per tenant. In
this case, it only makes sense to run the tenant network (UDN) on the nodes where a tenant is allowed to run pods. This
allows for horizontal scaling to much higher number of overall UDNs running in a cluster.

## Goals

* To dynamically allow the network to only be rendered on specific nodes.
* To increase overall scalability of the number UDNs in a Kubernetes cluster with this solution.
* To increase the efficiency of ovnkube operations on nodes where a UDN exists, but is not needed.

## Non-Goals

* To fully solve control plane performance issues with UDNs. There will be several other fixes done to address that
outside of this enhancement.
* To provide any type of network security guarantee about exposing UDNs to limited subset of nodes.

## Future Goals

* Potentially enabling this feature on a per UDN basis, rather than globally.

## Introduction

The purpose of this feature is to add a configuration knob that users can turn on which will only render UDNs on nodes
where pods exist on that UDN. This feature will allow for higher overall UDN scale and less per-node control plane resource usage
under conditions where clusters do not have pods on every node, with connections to all UDNs. For example, if I have
1000 UDNs and 500 nodes, if a particular node only has pods connected to say 200 of those UDNs, then my node is only
responsible for rendering 200 UDNs instead of 1000 UDNs as it does today.

This can provide significant control plane savings, but comes at a cost. Using the previous example, if a pod is now
launched in UDN 201, the node will have to render UDN 201 before the pod can be wired. In other words, this introduces
a one time larger pod latency cost for the first pod wired to the UDN. Additionally, there are more tradeoffs with other
feature limitations outlined later in this document.

## User-Stories/Use-Cases

Story 1: Segment groups of nodes per tenant

As a cluster admin, I plan to dedicate groups of nodes to either a single tenant or small group of tenants. I plan
to create a CUDN per tenant, which means my network will only really need to exist on this group of nodes. I would
like to be able to limit this network to only be rendered on that subset of nodes.
This way I will be able to have less resource overhead from OVN-Kubernetes on each node,
and be able to scale to a higher number of UDNs in my cluster.

## Proposed Solution

The proposed solution is to add a configuration knob to OVN-Kubernetes, "--dynamic-udn-allocation", which will enable
this feature. Once enabled, NADs derived from CUDNs and UDNs will only be rendered on nodes where there is a pod
scheduled in that respective network. Additionally, if the node is scheduled as an Egress IP Node for a UDN, this node
will also render the UDN.

When the last pod on the network is deleted from a node, OVNK will not immediately tear down the UDN.
Instead, OVNK will rely on a dead timer to expire to conclude that this UDN is no longer in use and
may be removed. This timer will also be configurable in OVN-Kubernetes as "--udn-deletion-grace-period".

### API Details

There will be no API changes. There will be new status conditions introduced in the section below.

### Implementation Details

In OVN-Kubernetes we have three main controllers that handle rendering of networking features for UDNs. They exist as
- Cluster Manager - runs on the control-plane, handles cluster-wide allocation, rendering of CUDN/UDNs
- Controller Manager - runs on a per-zone basis, handles configuring OVN for all networking features
- Node Controller Manager - runs on a per-node basis, handles configuring node specific things like nftables, VRFs, etc.

With this change, Cluster Manger will be largely untouched, while Controller Manager and Node Controller Manager will be
modified in a few places to filter out rendering UDNs when a pod doesn't exist.

#### Internal Controller Details

In OVN-Kubernetes we have many controllers that handle features for different networks, encompassed under three
controller manager containers. The breakdown of how these will be modified is outlined below:

* Cluster Manager
* UDN Controller — No change
* Route Advertisements Controller — No change
* Egress Service Cluster — Doesn't support UDN
* Endpoint Mirror Controller — No change
* EgressIP Controller — No change
* Unidling Controller — No change
* DNS Resolver — No change
* Network Cluster Controller — Modified to report status and exclude nodes not serving the UDN
* Controller Manager (ovnkube-controller)
* Default Network — No change
* NAD Controller — Ignore NADs for UDNs that are not active on this node (no pods for the UDN and not an EIP node)
* Node Controller Manager
* Default Network — No change
* NAD Controller — Ignore NADs for UDNs that are not active on this node (no pods for the UDN and not an EIP node)

The resulting NAD Controller change will filter out NADs that do not apply to this node, stopping NAD keys from being
enqueued to the Controller Manager/Node Controller Manager's Network Manager. Those Controller Managers will not need
to create or run any sub-controllers for nodes that do not have the network. To do this cleanly, NAD Controller will be
modified to hold a filterFunc field, which the respective controller manager can set in order to filter out NADs. For
Cluster Manager, this function will not apply, but for Controller Manager and Node Controller Manager it will be a function
that filters based on if the UDN is serving pods on this node.

#### New Pod/EgressIP Tracker Controller

In order to know whether the Managers should filter out a UDN, a pod controller and egress IP controller will be used
in the Managers to track this information in memory. The pod controller will be a new level driven controller for
each manager. For Egress IP, another new controller will be introduced that watches EgressIPs, Namespaces, and NADs in
order to track which NAD maps to a node serving Egress IP.

When Managers are created, they will start these Pod/EgressIP Tracker Controllers, and set a filterFunc on NAD Controller.
The filterFunc will query the aforementioned controllers to determine if the NAD being synced matches the local node. If
not, then NADController will not create the UDN controller for that network.

Additionally, the Pod/EgressIP Tracker Controllers will expose a callback function, called "onNetworkRefChange". When
the first pod is detected as coming up on a node + NAD combination, or the node activates as an Egress IP node for the
first time, onNetworkRefChange will be triggered, which allows a callback mechanism to be leveraged for events. The
Controller Manager and Node Controller Manager will leverage this callback, so that they can trigger NAD Controller to
reconcile the NAD for these events. This is important as it provides a way to signal that NADController should remove
a UDN controller if it is no longer active, or alternatively, force the NAD Controller to reconcile a UDN Controller if for example,
a new remote node has activated.

#### Other Controller Changes

The Layer3 network controller will need to filter out nodes where the UDN is not rendered. Upon receiving events,
they will query a Manager function called NodeHasNAD. Managers will export a Tracker interface, that only contains this
method for UDN Controllers to query. The implementation of NodeHasNAD will rely on the Manager querying their pod and
egress IP trackers.

Upon UDN activation of a remote node, these controllers will need to receive events in order to reconcile the new remote node.
To do this, the corresponding tracker will trigger its callback, "onNetworkRefChange". That will trigger the Manager
to ask NAD Controller to reconcile the UDN controller belonging to this NAD. Once that Layer 3 UDN controller reconciles,
it will walk nodes and determine what needs to be added or removed. It will take the applicable nodes, set their
syncZoneICFailed status, then immediately queue the objects to the retry framework with no backoff. This will allow
the Zone IC (ZIC) controller to properly configure the transit switch with the remote peers, or tear it down, if necessary.

#### Status Condition and Metric Changes

A new status condition will be added to CUDN/UDN that will indicate how many nodes are selected for a network:
```yaml
status:
conditions:
- type: NodesSelected
status: "True"
reason: DynamicAllocation
message: "5 nodes rendered with network"
lastTransitionTime: 2025-09-22T20:10:00Z
```

If the status is "False", then no nodes are currently allocated for the network - no pods or egress IPs assigned.

Cluster Manager will leverage instances of EgressIP and Pod Trackers in order to use that data for updating this status.
The nodes serving a network are defined as a node with at least one OVN networked pod or having an Egress IP assigned to
it on a NAD that maps to a UDN or CUDN.

Additionally, events will be posted to the corresponding UDN/CUDN when nodes have become active or inactive for
a node. This was chosen instead of doing per node status events, as that can lead to scale issues. Using events provides
the audit trail, without those scale implications. The one drawback of this approach pertains to UDN deactivation. There
is an "udn-deletion-grace-period" timer used to delay deactivation of a UDN on a node. This is to prevent churn if a pod
is deleted, then almost immediately re-added. Without storing the timestamp in the API, we are relying internally on in
memory data. While this is fine for normal operation, if OVN-Kube pod restarts, we lose that context. However, this should
be fine as when we restart we have to walk and start all network controllers anyway, so we are not really creating a lot of
extra work for OVN-Kube here.

A metric will also be exposed which allows the user to track over time how many nodes were active for a particular
network.

### Testing Details

* Unit Tests will be added to ensure the behavior works as expected, including checking that
OVN switches/routers are not created there is no pod/egress IP active on the node, etc.
* E2E Tests will be added to create a CUDN/UDN with the feature enabled and ensure pod traffic works correctly between nodes.
* Benchmark/Scale testing will be done to show the resource savings of 1000s of nodes with 1000s of UDNs.

### Documentation Details

* User-Defined Network feature documentation will be updated with a user guide for this new feature.

## Risks, Known Limitations and Mitigations

Risks:
* Additional first-pod cold start latency per UDN/node. Could impact pod readiness SLOs.
* Burst reconcile load on large rollouts of pods on inactive nodes.

Limitations:
* No OVN central support.
* NodePort/ExternalIP services with external traffic policy mode "cluster", will not work when sending traffic to inactive nodes.
* MetalLB must be configured on nodes where the UDN is rendered. This can be achieved by scheduling a daemonset for the designated nodes on the UDN.

## OVN Kubernetes Version Skew

Targeted for release 1.2.

## Alternatives

Specifying a NodeSelector in the CUDN/UDN CRD in order to determine where a network should be rendered. This was the
initial idea of this enhancement, but was evaluated as less desirable than dynamic allocation. The dynamic allocation
provides more flexibility without a user/admin needing to intervene and update a CRD.

## References

None
3 changes: 2 additions & 1 deletion go-controller/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,8 @@ ifeq ($(CONTAINER_RUNNABLE), 0)
@GOPATH=${GOPATH} ./hack/lint.sh $(CONTAINER_RUNTIME) fix || { echo "ERROR: lint fix failed! There is a bug that changes file ownership to root \
when this happens. To fix it, simply run 'chown -R <user>:<group> *' from the repo root."; exit 1; }
else
echo "linter can only be run within a container since it needs a specific golangci-lint version"; exit 1
echo "no container runtime. attempting to run natively";
@GOPATH=${GOPATH} ./hack/lint.sh run-natively || { echo "running lint locally failed!"; exit 1;
endif

gofmt:
Expand Down
28 changes: 19 additions & 9 deletions go-controller/hack/lint.sh
Original file line number Diff line number Diff line change
@@ -1,18 +1,28 @@
#!/usr/bin/env bash
VERSION=v1.64.8
extra_flags=""
: "${GOLANGCI_LINT_VERSION:=$VERSION}"
extra_flags=(--verbose --print-resources-usage --modules-download-mode=vendor --timeout=15m0s)
if [ "$#" -ne 1 ]; then
if [ "$#" -eq 2 ] && [ "$2" == "fix" ]; then
extra_flags="--fix"
extra_flags+=(--fix)
else
echo "Expected command line argument - container runtime (docker/podman) got $# arguments: $@"
echo "Expected command line argument - container runtime (docker/podman) or 'run-natively'; got $# arguments: $*"
exit 1
fi
fi

$1 run --security-opt label=disable --rm \
-v ${HOME}/.cache/golangci-lint:/cache -e GOLANGCI_LINT_CACHE=/cache \
-v $(pwd):/app -w /app -e GO111MODULE=on docker.io/golangci/golangci-lint:${VERSION} \
golangci-lint run --verbose --print-resources-usage \
--modules-download-mode=vendor --timeout=15m0s ${extra_flags} && \
echo "lint OK!"
if [ "$1" = "run-natively" ]; then
mkdir -p /tmp/local/bin/
curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b /tmp/local/bin/ "${GOLANGCI_LINT_VERSION}"
mkdir -p /tmp/golangci-cache
export GOLANGCI_LINT_CACHE=/tmp/golangci-cache
/tmp/local/bin/golangci-lint run "${extra_flags[@]}" && \
echo "lint OK!"
else
$1 run --security-opt label=disable --rm \
-v "${HOME}"/.cache/golangci-lint:/cache -e GOLANGCI_LINT_CACHE=/cache \
-v "$(pwd)":/app -w /app -e GO111MODULE=on docker.io/golangci/golangci-lint:"${VERSION}" \
golangci-lint run "${extra_flags[@]}" && \
echo "lint OK!"
fi

14 changes: 14 additions & 0 deletions go-controller/pkg/clustermanager/egressip_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,15 @@ func (eIPC *egressIPClusterController) getAllocationTotalCount() float64 {
return float64(count)
}

func (e *egressNode) hasAllocatedEgressIP(name string, eip string) bool {
for ip, egressIPName := range e.allocations {
if egressIPName == name && ip == eip {
return true
}
}
return false
}

// nodeAllocator contains all the information required to manage EgressIP assignment to egress node. This includes assignment
// of EgressIP IPs to nodes and ensuring the egress nodes are reachable. For cloud nodes, it also tracks limits for
// IP assignment to each node.
Expand Down Expand Up @@ -865,6 +874,7 @@ func (eIPC *egressIPClusterController) addAllocatorEgressIPAssignments(name stri
defer eIPC.nodeAllocator.Unlock()
for _, status := range statusAssignments {
if eNode, exists := eIPC.nodeAllocator.cache[status.Node]; exists {
klog.V(5).Infof("Setting egress IP node allocation - node: %s, EIP name: %s, IP: %s", eNode.name, name, status.EgressIP)
eNode.allocations[status.EgressIP] = name
}
}
Expand Down Expand Up @@ -1423,6 +1433,10 @@ func (eIPC *egressIPClusterController) validateEgressIPStatus(name string, items
klog.Errorf("Allocator error: EgressIP: %s claims multiple egress IPs on same node: %s, will attempt rebalancing", name, eIPStatus.Node)
validAssignment = false
}
if !eNode.hasAllocatedEgressIP(name, eIPStatus.EgressIP) {
klog.Errorf("Allocator error: EgressIP: %s has mistmach with status vs cache for node: %s with IP: %s", name, eIPStatus.Node, eIPStatus.EgressIP)
validAssignment = false
}
if !eNode.isEgressAssignable {
klog.Errorf("Allocator error: EgressIP: %s assigned to node: %s which does not have egress label, will attempt rebalancing", name, eIPStatus.Node)
validAssignment = false
Expand Down
Loading