Skip to content

Comments

Cherry-pick BGP fixes into 4.19#2731

Closed
tssurya wants to merge 22 commits intoopenshift:release-4.19from
tssurya:cherry-pick-bgp-fixes-into-4.19
Closed

Cherry-pick BGP fixes into 4.19#2731
tssurya wants to merge 22 commits intoopenshift:release-4.19from
tssurya:cherry-pick-bgp-fixes-into-4.19

Conversation

@tssurya
Copy link
Contributor

@tssurya tssurya commented Aug 21, 2025

This PR cherry-picks ovn-kubernetes/ovn-kubernetes#5140 and ovn-kubernetes/ovn-kubernetes#5463 and ovn-kubernetes/ovn-kubernetes#5276 into 4.19.

Ideally we want to sync the code from 4.20/master into 4.19

But if we merge code now, that would require golang 1.24 to be in 4.19 to build the image and ART team tells us that is not available yet. https://issues.redhat.com/browse/ART-14014
But we don't know when that will happen, we don't want to wait since BGP GA is part of this sprint's goal.

tssurya added 14 commits August 21, 2025 15:00
Today when default network or UDN networks are
advertised using RAs the nodes also learn the
routes of other nodes' pod subnets in the cluster.

Example snippet of exposing a UDN network on
non-vrflite usecase:

root@ovn-worker2:/# ip r show table 1048
default via 172.18.0.1 dev breth0 mtu 1400
10.96.0.0/16 via 169.254.0.4 dev breth0 mtu 1400
10.244.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
10.244.2.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20
103.103.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
103.103.1.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20
169.254.0.3 via 203.203.1.1 dev ovn-k8s-mp12
169.254.0.34 dev ovn-k8s-mp12 mtu 1400
172.26.0.0/16 nhid 41 via 172.18.0.5 dev breth0 proto bgp metric 20
203.203.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
203.203.0.0/16 via 203.203.1.1 dev ovn-k8s-mp12
203.203.1.0/24 dev ovn-k8s-mp12 proto kernel scope link src 203.203.1.2
local 203.203.1.2 dev ovn-k8s-mp12 proto kernel scope host src 203.203.1.2
broadcast 203.203.1.255 dev ovn-k8s-mp12 proto kernel scope link src 203.203.1.2
203.203.2.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20

root@ovn-worker2:/# ip r show table 1046
default via 172.18.0.1 dev breth0 mtu 1400
10.96.0.0/16 via 169.254.0.4 dev breth0 mtu 1400
10.244.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
10.244.2.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20
103.103.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
103.103.0.0/16 via 103.103.2.1 dev ovn-k8s-mp11
103.103.1.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20
103.103.2.0/24 dev ovn-k8s-mp11 proto kernel scope link src 103.103.2.2
local 103.103.2.2 dev ovn-k8s-mp11 proto kernel scope host src 103.103.2.2
broadcast 103.103.2.255 dev ovn-k8s-mp11 proto kernel scope link src 103.103.2.2
169.254.0.3 via 103.103.2.1 dev ovn-k8s-mp11
169.254.0.32 dev ovn-k8s-mp11 mtu 1400
172.26.0.0/16 nhid 41 via 172.18.0.5 dev breth0 proto bgp metric 20
203.203.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
203.203.2.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20
root@ovn-worker2:/#

this happens because we import routes from the
default VRF:

      prefixes:
      - 103.103.0.0/24
      - 2014:100:200::/64
      - 2016:100:200::/64
      - 203.203.0.0/24
    - asn: 64512
      imports:
      - vrf: default
      vrf: mp11-udn-vrf
    - asn: 64512
      imports:
      - vrf: default
      vrf: mp12-udn-vrf
  nodeSelector:
    matchLabels:
      kubernetes.io/hostname: ovn-worker
  raw: {}

root@ovn-worker2:/# ip r
default via 172.18.0.1 dev breth0 mtu 1400
10.96.0.0/16 via 169.254.0.4 dev breth0 mtu 1400
10.244.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
10.244.2.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20
103.103.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
103.103.1.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20
169.254.0.3 via 203.203.1.1 dev ovn-k8s-mp12
169.254.0.34 dev ovn-k8s-mp12 mtu 1400
172.26.0.0/16 nhid 41 via 172.18.0.5 dev breth0 proto bgp metric 20
203.203.0.0/24 nhid 39 via 172.18.0.4 dev breth0 proto bgp metric 20
203.203.0.0/16 via 203.203.1.1 dev ovn-k8s-mp12
203.203.1.0/24 dev ovn-k8s-mp12 proto kernel scope link src 203.203.1.2
local 203.203.1.2 dev ovn-k8s-mp12 proto kernel scope host src 203.203.1.2
broadcast 203.203.1.255 dev ovn-k8s-mp12 proto kernel scope link src 203.203.1.2
203.203.2.0/24 nhid 40 via 172.18.0.3 dev breth0 proto bgp metric 20

which directly breaks UDN isolation.

In this commit we are going to remove the support for receiving routes. So
advertising routes will only advertise routes and we will no longer
make the nodes receive these routes. However in the future when we support
overlay-mode with BGP, we will need to re-add these routes and design
a better isolation model with UDNs within the cluster if that is
desired.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 410550f)
This is a temporary commit - we need a proper followup.
Please see ovn-kubernetes/ovn-kubernetes#5407
for details.

As of today all NATs created by OVN-Kubernetes are unique
using the existing 5 tuple algo in IsEquivalentNAT - uuid,
type of snat, logicalIP, logicalPort, externalIP, externalIDs.

So its OK to get rid of match. But its not the correct way to
fix this - in future we might have two NATs with all other
fields except match being the same.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit ea1b6a0)
This PR is adding SNAT for advertised
UDNs and CDN if the destination of the traffic
is towards other nodes in the cluster.

This is a design change for BGP from
before (where pod->node was not SNATed
and podIP was preserved).

For normal UDNs we have 2 SNATs:

L3 UDN SNATs:

1) this cSNAT is added to ovn_cluster_router
for LGW egress traffic and SGW KAPI/DNS traffic:

_uuid               : 5485a25f-7a83-4dc0-840c-bbfbd0784aad
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-green-network, "k8s.ovn.org/topology"=layer3}
external_ip         : "169.254.0.38"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "203.203.0.0/24"
logical_port        : rtos-cluster_udn_tenant.green.network_ovn-control-plane
match               : "eth.dst == 0a:58:cb:cb:00:02"
options             : {stateless="false"}
priority            : 0
type                : snat

2) this SNAT is added to GR for SGW egress traffic:

_uuid               : d85fd65f-e3f3-4d52-95f9-5f88c925aa5a
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-green-network, "k8s.ovn.org/topology"=layer3}
external_ip         : "169.254.0.37"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "203.203.0.0/16"
logical_port        : []
match               : ""
options             : {stateless="false"}
priority            : 0
type                : snat

for L2, we have the following two SNATs both on GR:

_uuid               : a4b9942f-ec1a-42ca-81d9-3e4885ff2470
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-blue-network, "k8s.ovn.org/topology"=layer2}
external_ip         : "169.254.0.36"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "93.93.0.0/16"
logical_port        : rtoj-GR_cluster_udn_tenant.blue.network_ovn-control-plane
match               : "eth.dst == 0a:58:5d:5d:00:02"
options             : {stateless="false"}
priority            : 0
type                : snat

and

_uuid               : 24164866-da95-4b6f-9c65-8b16fa202758
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-blue-network, "k8s.ovn.org/topology"=layer2}
external_ip         : "169.254.0.35"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "93.93.0.0/16"
logical_port        : []
match               : "outport == \"rtoe-GR_cluster_udn_tenant.blue.network_ovn-control-plane\""
options             : {stateless="false"}
priority            : 0
type                : snat

now with advertised networks these will change to:

_uuid               : a4b9942f-ec1a-42ca-81d9-3e4885ff2470
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-blue-network, "k8s.ovn.org/topology"=layer2}
external_ip         : "169.254.0.36"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "93.93.0.0/16"
logical_port        : rtoj-GR_cluster_udn_tenant.blue.network_ovn-control-plane
match               : "eth.dst == 0a:58:5d:5d:00:02 && (ip4.dst == $a712973235162149816)"
options             : {stateless="false"}
priority            : 0
type                : snat

_uuid               : 24164866-da95-4b6f-9c65-8b16fa202758
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-blue-network, "k8s.ovn.org/topology"=layer2}
external_ip         : "169.254.0.35"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "93.93.0.0/16"
logical_port        : []
match               : "outport == \"rtoe-GR_cluster_udn_tenant.blue.network_ovn-control-plane\" && ip4.dst == $a712973235162149816"
options             : {stateless="false"}
priority            : 0
type                : snat

_uuid               : d85fd65f-e3f3-4d52-95f9-5f88c925aa5a
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-green-network, "k8s.ovn.org/topology"=layer3}
external_ip         : "169.254.0.37"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "203.203.0.0/16"
logical_port        : []
match               : "ip4.dst == $a712973235162149816"
options             : {stateless="false"}
priority            : 0
type                : snat

_uuid               : 5485a25f-7a83-4dc0-840c-bbfbd0784aad
allowed_ext_ips     : []
exempted_ext_ips    : []
external_ids        : {"k8s.ovn.org/network"=cluster_udn_tenant-green-network, "k8s.ovn.org/topology"=layer3}
external_ip         : "169.254.0.38"
external_mac        : []
external_port_range : "32768-60999"
gateway_port        : []
logical_ip          : "203.203.0.0/24"
logical_port        : rtos-cluster_udn_tenant.green.network_ovn-control-plane
match               : "eth.dst == 0a:58:cb:cb:00:02 && (ip4.dst == $a712973235162149816)"
options             : {stateless="false"}
priority            : 0
type                : snat

so basically we add this extra match for destination IPs to SNAT to masqueradeIP for that UDN

note: with this PR we will break hardware offload for assymmetry traffix for BGP L2

As for the CDN, we have 1 SNAT with no matches on GR and that is being changed
to now have a cSNAT in case the default network is advertised.

NOTE: In -ds flag mode, the per-pod SNAT will have this match set.
NOTE2: For all deleteNAT scenarios we purposefully don't pass snat as a match criteria

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 15adf65)
Given that some traffic like pod->node and pod->nodeport
will be SNATed to nodeIP for UDNs, we will need iprules for both
masqueradeIP and nodeIP to be present when networks are
advertised. This is nothing complicated as keeping
the masqueradeIP dangling around doesn't hurt anything (I hope :))

so for pod->node it follows the normal UDN LGW egress traffic flow:

1) pod->switch->ovn_cluster_router
2) SNAT at the router to masIP
3) ovn_cluster_router->switch->mpX
4) goes out and then

reply coming from outside will hit these masqueradeIP rules to come
back in since we snated to masqueradeIP on the way out, so we need
both podsubnet and masqueradeIP rules for advertised networks

for all other traffic no SNATing is done

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit f32731c)
This commit is a prep-commit that converts
the LGW POSTROUTING chain rules from IPT
to NFT.
Why do we need to do this now?
It's because for BGP we want to use the PMTUD remote nodeIP
NFT sets to also do conditional masquerading in Local Gateway mode
for BGP when traffic leaves UDNs towards other nodes in the cluster
or other nodeports.
Given PMTUD rules are in NFT but the lgw and udn masquerade rules are
in IPT - we'd need to pick one to express all - since we want to
move to NFT, its better to go that route.

Below is how the rules look like.

        chain ovn-kube-local-gw-masq {
		comment "OVN local gateway masquerade"
		type nat hook postrouting priority srcnat; policy accept;
		ip saddr 169.254.0.1 masquerade
                ip6 saddr fd69::1 masquerade
		jump ovn-kube-pod-subnet-masq
		jump ovn-kube-udn-masq
	}

	chain ovn-kube-pod-subnet-masq {
		ip saddr 10.244.2.0/24 masquerade
                ip6 saddr fd00:10:244:1::/64 masquerade
	}

	chain ovn-kube-udn-masq {
		comment "OVN UDN masquerade"
                ip saddr != 169.254.0.0/29 ip daddr != 10.96.0.0/16 ip saddr 169.254.0.0/17 masquerade
                ip6 saddr != fd69::/125 ip daddr != fd00:10:96::/112 ip6 saddr fd69::/112 masquerade
	}

This commit was AI-Cursor-gemini/claude assissted
under my supervision/prompting/reviewing/back-forth iterations

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 501bcbf)

 Conflicts:
	go-controller/pkg/node/gateway_init_linux_test.go

because of OCPHACKs
let's reuse the pmtud address-set ips of the remote
nodes ips also for bgp advertised networks cSNAT

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit a67872d)
This commit is valid only for default networks
as mentioned in title. It's because unlike in
UDNs where we do cSNATs in OVN on router at the edge
before it leaves to node, for CDN everything happens
on the node side already - so we can leverage the
nodeIP masquerade bits.

if network is advertised:
	chain ovn-kube-pod-subnet-masq {
		ip saddr 10.244.2.0/24 ip daddr @remote-node-ips-v4 masquerade
		ip6 saddr fd00:10:244:3::/64 ip6 daddr @remote-node-ips-v6 masquerade
	}

else:

	chain ovn-kube-pod-subnet-masq {
		ip saddr 10.244.2.0/24 masquerade
                ip6 saddr fd00:10:244:3::/64 masquerade
	}

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 04d48c3)
1) remove the l2 failure limitation since we now use nodeIPs
reply knows how to go back to src node since we have routes for that
2) add udn pod -> default network nodeport service (same and diff node)
3) add udn pod -> udn network nodeport service (same and diff node) - same network
4) add udn pod -> udn network nodeport service (same and diff node) - different network

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 8a65723)
In the previous commits we added SNATing to nodeIP
for the following traffic flows:

pod -> nodes
pod -> nodeports

when pods are part of advertised networks. Prior to
SNATing to nodeIPs they are SNATed at the ovn_cluster_router
to masqueradeIP before being sent into the host.

In commit ovn-kubernetes/ovn-kubernetes@75dd73f
we had converted all UDN flows that matched on masqueradeIP
as the source on breth0 for UDN pods to services traffic flow
to instead match on the podsubnets.

However given we have pod to node and pod to nodeport
traffic flows using masqueradeIP as the SNAT we need to
now re-add the masqueradeIP flows as well to ensure that
nodeports isolation between UDNs work correctly.

Before this commit:

In LGW/SGW flow is: UDN pod -> samenodeIP:nodeport in default network ->
SNATed to masqueradeIP of that UDN -> sent to host -> SNATed to clusterIP ->
hits the default flow in table=2 in br-ex:

 cookie=0xdeff105, duration=15690.053s, table=2, n_packets=0, n_bytes=0, idle_age=15690, priority=100 actions=mod_dl_dst:6e:4d:97:c0:3c:97,output:2

and sends to patch port of default network and this traffic
starts working when it shouldn't. (I mean eventually we want
this to work, see ovn-kubernetes/ovn-kubernetes#5410
but that's a future issue - outside my PR's scope)

In case of L3 UDN advertised pod -> nodeport service in default or other UDN network:
ovn-kubernetes/ovn-kubernetes@d63887e
is the commit where we added logic to match on srcIP of the traffic and
accordingly route it into the respective UDN patchports. So there we use
the masqueradeIP of a particular UDN to determine what the source of the traffic
was and route it into that particular UDN's patchport where it would backhole
if there was no matching clusterIP NAT entry there, and this is how
isolation was guaranteed.

Recently this was changed to a hard drop: ovn-kubernetes/ovn-kubernetes@dcc403c

For l2 topology the logic is same as above for clusterIPs but
for nodeports the GR itself drops the packets destined
towards the other networks as there is no LB entry present on
the GR as the destination IP is that of the router itself. That's how
isolation works there:

sample trace:
    next;
10. ls_out_apply_port_sec (northd.c:6039): 1, priority 0, uuid 2aa6ebd5
    output;
    /* output to "stor-cluster_udn_tenant.blue.network_ovn_layer2_switch", type "l3gateway" */

ingress(dp="GR_cluster_udn_tenant.blue.network_ovn-worker2", inport="rtos-cluster_udn_tenant.blue.network_ovn_layer2_switch")
-----------------------------------------------------------------------------------------------------------------------------
 0. lr_in_admission (northd.c:13232): eth.dst == 0a:58:64:41:00:03 && inport == "rtos-cluster_udn_tenant.blue.network_ovn_layer2_switch", priority 50, uuid 7f9af183
    reg9[1] = check_pkt_larger(1414);
    xreg0[0..47] = 0a:58:64:41:00:03;
    next;
 1. lr_in_lookup_neighbor (northd.c:13420): 1, priority 0, uuid d2672052
    reg9[2] = 1;
    next;
 2. lr_in_learn_neighbor (northd.c:13430): reg9[2] == 1 || reg9[3] == 0, priority 100, uuid 84ca0ef4
    mac_cache_use;
    next;
 3. lr_in_ip_input (northd.c:12824): ip4.dst == {172.18.0.4}, priority 60, uuid ea41c4e7
    drop;

Without this fix:

[FAIL] BGP: isolation between advertised networks Layer3 connectivity between networks [It] pod in the UDN should not be able to access a default network service

the above test will work in LGW when it should not work like is
the case for non-advertised UDNs.

This commit adds back the masqueradeIP flow as well for advertised
networks that drops all packets that didn't get routed on the higher
priority pkt_mark flows at 250.

when 2 UDNs are advertised:

this PR added back these two flows with masqueradeIP match:
cookie=0xdeff105, duration=127.593s, table=2, n_packets=0, n_bytes=0, priority=200,ip,nw_src=169.254.0.12 actions=drop
cookie=0xdeff105, duration=127.534s, table=2, n_packets=0, n_bytes=0, priority=200,ip,nw_src=169.254.0.14 actions=drop

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 10ea4ab)
Currently there are two bugs around using priority 100
for ovn-kube-local-gw-masq chain.

EgressIPs multinic rules are still in legacy IPT:

[0:0] -A OVN-KUBE-EGRESS-IP-MULTI-NIC -s 10.244.2.6/32 -o eth1 -j SNAT --to-source 10.10.10.105
[0:0] -A OVN-KUBE-EGRESS-IP-MULTI-NIC -s 10.244.0.3/32 -o eth1 -j SNAT --to-source 10.10.10.105
[1:60] -A OVN-KUBE-EGRESS-IP-MULTI-NIC -s 10.244.1.3/32 -o eth1 -j SNAT --to-source 10.10.10.105

and in netfilter the priority of NAT POSTROUTNG HOOK is 100
and not configurable. NF_IP_PRI_NAT_SRC in netfilter

and for NFTables its the same value 100 for NAT POSTROUTING hook
and its called "srcnat" in knftables and set to 100.

and this is the priority used by egress service feature since
that is already converted to NFT:

	chain egress-services {
		type nat hook postrouting priority srcnat; policy accept;
		meta mark 0x000003f0 return comment "DoNotSNAT"
		snat ip to ip saddr map @egress-service-snat-v4
		snat ip6 to ip6 saddr map @egress-service-snat-v6
	}

and now that we have converted POSTROUTING rules for
local-gw as well to NFT, those rules were already at priority 100.

Unlike IPT rules where we could jump to EIP and ESVC chains
before masquerade rules got hit, here those chains in NFT are
all parallel at same priority 100 and we don't know which one
will be hit first. Hence we need to change the priority of
ovn-kube-local-gw-masq so that EIP/ESVC rules are hit before
the default masquerade rules

W/O this change EIP/ESVC tests fail in CI

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 8f5b3d4)
Prior to this change, the remote PMTUD address sets were only
considering the primary IP of the node.
While that was OK for PMTUD use case perhaps but for BGP
now that we reuse this address set in NFT we need to consider
all the IPs on the remote nodes.

So this commit changes code from using node internal IPs to
using the HostCIDRs annotation

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 659010c)
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 0635cae)
When using the onModelUpdatesAllNonDefault() from
NAT updates, it wasn't updating match value when we
wanted to reset it. So when we went from advertised network
to non-advertised network, we were not changing the SNAT
match and hence traffic was still going out with podIP
instead of nodeIP.

This commit fixes that.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit 5056d4d)
See ovn-kubernetes/ovn-kubernetes#5419 for details

But the traffic flow looks like this for Layer3(v4 and v6) and Layer2(v4):

pod in UDN A -> sameNodeIP:NodePort i.e 172.18.0.2:30724

pod (102.102.2.4)-> ovn-switch->ovn-cluster-router (SNAT to masqueradeIP 169.254.0.14)->
LRP send to mpX ->
in the host (IPTable DNAT from nodePort to clusterIP 10.96.96.233:8080)
send to breth0
breth0 flows reroute packet to UDN B's patchport
hits the GR of UDNB and DNATs from clusterIP to backend pod that lives on another node (103.103.1.5) at the same time SNAT to joinIP in
OVN router i.e 100.65.0.4
reponse comes back from remote pod
and then we see ARP requests trying to understand how to reach the masqueradeIP of the other network which makes total sense - so reply fails

NetworkB doesn't know how to reach back to NetworkA's masqueradeIP which is the srcIP.

root@ovn-control-plane:/# tcpdump -i any -nneev port 36363 or port 30724 or host 102.102.2.4 or host 169.254.0.14 or host 100.65.0.4
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
08:55:14.083364 865a53b516350_3 P   ifindex 19 0a:58:66:66:02:04 ethertype IPv4 (0x0800), length 80: (tos 0x0, ttl 64, id 53100, offset 0, flags [DF], proto TCP (6), length 60)
    102.102.2.4.42720 > 172.18.0.2.30724: Flags [S], cksum 0x14ad (incorrect -> 0x5e6c), seq 432663101, win 65280, options [mss 1360,sackOK,TS val 1239378349 ecr 0,nop,wscale 7], length 0
08:55:14.084049 ovn-k8s-mp2 In  ifindex 14 0a:58:66:66:02:01 ethertype IPv4 (0x0800), length 80: (tos 0x0, ttl 63, id 53100, offset 0, flags [DF], proto TCP (6), length 60)
    169.254.0.14.42826 > 172.18.0.2.30724: Flags [S], cksum 0x1c60 (correct), seq 432663101, win 65280, options [mss 1360,sackOK,TS val 1239378349 ecr 0,nop,wscale 7], length 0
08:55:14.084069 breth0 Out ifindex 6 6a:ed:17:fb:28:bd ethertype IPv4 (0x0800), length 80: (tos 0x0, ttl 62, id 53100, offset 0, flags [DF], proto TCP (6), length 60)
    169.254.0.14.42826 > 10.96.96.233.8080: Flags [S], cksum 0xb59f (correct), seq 432663101, win 65280, options [mss 1360,sackOK,TS val 1239378349 ecr 0,nop,wscale 7], length 0
08:55:14.084470 genev_sys_6081 Out ifindex 7 0a:58:64:58:00:04 ethertype IPv4 (0x0800), length 80: (tos 0x0, ttl 60, id 53100, offset 0, flags [DF], proto TCP (6), length 60)
    100.65.0.4.42826 > 103.103.1.5.8080: Flags [S], cksum 0xfe43 (correct), seq 432663101, win 65280, options [mss 1360,sackOK,TS val 1239378349 ecr 0,nop,wscale 7], length 0
08:55:14.085494 genev_sys_6081 P   ifindex 7 0a:58:64:58:00:02 ethertype IPv4 (0x0800), length 80: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    103.103.1.5.8080 > 100.65.0.4.42826: Flags [S.], cksum 0x1f4f (correct), seq 3390013464, ack 432663102, win 64704, options [mss 1360,sackOK,TS val 1866737591 ecr 1239378349,nop,wscale 7], length 0
08:55:14.086130 eth0  Out ifindex 2 6a:ed:17:fb:28:bd ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Request who-has 169.254.0.14 tell 169.254.0.15, length 28
08:55:14.086172 breth0 B   ifindex 6 6a:ed:17:fb:28:bd ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Request who-has 169.254.0.14 tell 169.254.0.15, length 28
08:55:15.100558 genev_sys_6081 P   ifindex 7 0a:58:64:58:00:02 ethertype IPv4 (0x0800), length 80: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    103.103.1.5.8080 > 100.65.0.4.42826: Flags [S.], cksum 0xccdf (incorrect -> 0x1b57), seq 3390013464, ack 432663102, win 64704, options [mss 1360,sackOK,TS val 1866738607 ecr 1239378349,nop,wscale 7], length 0
08:55:15.101090 eth0  Out ifindex 2 6a:ed:17:fb:28:bd ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Request who-has 169.254.0.14 tell 169.254.0.15, length 28
08:55:15.101124 breth0 B   ifindex 6 6a:ed:17:fb:28:bd ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Request who-has 169.254.0.14 tell 169.254.0.15, length 28

^ its the same for Layer3 v6 as well and same for Layer2 v4 ^^

but Layer2 v6 is weird thanks to:

// cookie=0xdeff105, duration=173.245s, table=1, n_packets=0, n_bytes=0, idle_age=173, priority=14,icmp6,icmp_type=134 actions=FLOOD
// cookie=0xdeff105, duration=173.245s, table=1, n_packets=8, n_bytes=640, idle_age=4, priority=14,icmp6,icmp_type=136 actions=FLOOD

these two flows on breth0 - these seem to  be flooding the NDP requests between the GR's of all networks somehow and v6 works.
So test is acknowledging this inconsistency and calling this out.

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
(cherry picked from commit e8fc764)
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 21, 2025
@tssurya tssurya changed the base branch from master to release-4.19 August 21, 2025 13:23
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 21, 2025
@tssurya tssurya changed the title Cherry pick bgp fixes into 4.19 Cherry-pick BGP fixes into 4.19 Aug 21, 2025
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2025
@tssurya tssurya force-pushed the cherry-pick-bgp-fixes-into-4.19 branch from a193351 to 25c7daf Compare August 21, 2025 13:30
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcaamano
Copy link
Contributor

@tssurya any reason you didn't bring back also these changes which are part of the epic:
ovn-kubernetes/ovn-kubernetes#5463

jcaamano and others added 6 commits August 21, 2025 20:11
Just as we currently do with traffic towards nodes.

Specifically this allows for networks advertised with a VRF-Lite
configuration with a subnet overlap to reach these services. Otherwise
the return path could hit an ip rule corresponding to a different
advertised network forwarding it to an inappropriate destination.

Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
(cherry picked from commit bcfce1b)
…default VRF"

This reverts commit 1ea2739.

Signed-off-by: Jaime Caamaño Ruiz <jcaamano@redhat.com>
(cherry picked from commit 7db6c99)
This global knob helps to enable (or) disable pod isolation between
BGP advertised UDN networks. The routed udn isolation is enabled
by default. This can be disabled on kind with -rnd or
--routed-udn-isolation-disable options while setting up the cluster.

Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
(cherry picked from commit e1ac399)

Conflicts:
	contrib/kind.sh
because ovn-kubernetes/ovn-kubernetes#5466 is not there in 4.19 yet
	dist/images/daemonset.sh
because ovn-kubernetes/ovn-kubernetes#5425 is not there in 4.19 yet
When Routed UDN Isolation is disabled, then ovnk must skip programming
advertised network isolation rules on the given node so that traffic
between advertised UDN networks can be steered out from the ovn overlay
network, then with additional manual networking configuration in the
underlay network inter UDN traffic can be made to work.
To facilitate this, this commit skips programming network isolation rules
when the routed udn isolation option is disabled.

Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
(cherry picked from commit 636eaeb)
Co-Authored-by: Peng Liu <pliu@redhat.com>
Signed-off-by: Periyasamy Palanisamy <pepalani@redhat.com>
(cherry picked from commit b1c9b28)

Conflicts:
	.github/workflows/test.yml
because ovn-kubernetes/ovn-kubernetes#5425
and ovn-kubernetes/ovn-kubernetes#5429
are not there in 4.19
…ose mode

In the advertised UDN isolation loose mode test, cross-UDN traffic
will be routed by the external FRR router. Nodes shall send the UDN
pod outbound traffic to the FRR router as the nexthop.

Signed-off-by: Peng Liu <pliu@redhat.com>
(cherry picked from commit 01fccb7)
pliurh added 2 commits August 21, 2025 20:12
- Add ingress flows to table 0 (priority 300/301) for MEG-enabled
  pods, advertised UDNs, and node management traffic, ensuring these
  are handled earlier in the pipeline. In LGW mode, the 301 flow is
  unnecessary, as the traffic to mgmtIP will be forward to host
  kernel by the 300 flow.
- Remove corresponding lower-priority flows (priority 15/16) from
  table 1 to avoid duplication and improve processing efficiency.
- Modify egress flows in table 0 (priority 104/103, previous 109/104)
  for advertised UDN or MEG egress traffic by not setting CT mark and
  send to physical network directly.

example flows in SGW mode EIP enabled:
	table=0, n_packets=0, n_bytes=0, priority=300,ip,in_port=eth0,nw_dst=<nodeSubnet> actions=output:4
	table=0, n_packets=0, n_bytes=0, priority=301,ip,in_port=eth0,nw_dst=<mgmtIP> actions=output:LOCAL
	table=0, n_packets=0, n_bytes=0, priority=104,ip,in_port=4,dl_src=02:42:ac:12:00:03,nw_src=<nodeSubnet> actions=output:eth0
	table=0, n_packets=0, n_bytes=0, priority=103,ip,in_port=4,nw_src=<clusterSubnet> actions=drop

example flows in LGW mode EIP enabled:
	table=0, n_packets=0, n_bytes=0, priority=300,ip,in_port=eth0,nw_dst=<nodeSubnet> actions=output:LOCAL
	table=0, n_packets=0, n_bytes=0, priority=104,ip,in_port=LOCAL,dl_src=02:42:ac:12:00:03,nw_src=<nodeSubnet> actions=output:eth0
	table=0, n_packets=0, n_bytes=0, priority=103,ip,in_port=4,nw_src=<clusterSubnet> actions=drop

example flows in SGW mode EIP disabled:
	table=0, n_packets=0, n_bytes=0, priority=300,ip,in_port=eth0,nw_dst=<nodeSubnet> actions=output:4
	table=0, n_packets=0, n_bytes=0, priority=301,ip,in_port=eth0,nw_dst=<mgmtIP> actions=output:LOCAL
	table=0, n_packets=0, n_bytes=0, priority=104,ip,in_port=4,dl_src=02:42:ac:12:00:03,nw_src=<nodeSubnet> actions=output:eth0

example flows in LGW mode EIP disabled:
	table=0, n_packets=0, n_bytes=0, priority=300,ip,in_port=eth0,nw_dst=<nodeSubnet> actions=output:LOCAL
        table=0, n_packets=0, n_bytes=0, priority=104,ip,in_port=LOCAL,dl_src=02:42:ac:12:00:03,nw_src=<nodeSubnet> actions=output:eth0

Signed-off-by: Peng Liu <pliu@redhat.com>
(cherry picked from commit 28c67ea)
…solation-mode

The configuration parameter 'routed-udn-isolation' has been renamed to
'advertised-udn-isolation-mode' to more accurately reflect its purpose as
a mode of operation rather than a simple boolean toggle.

The corresponding values have been changed from 'enabled'/'disabled' to
'strict'/'loose' for better clarity:
 - 'strict' (formerly 'enabled') enforces complete isolation between UDNs.
 - 'loose' (formerly 'disabled') allows for more relaxed connectivity.

Signed-off-by: Peng Liu <pliu@redhat.com>
(cherry picked from commit 742041b)

Conflicts:
	.github/workflows/test.yml
	dist/images/daemonset.sh
because ovn-kubernetes/ovn-kubernetes#5425 and
ovn-kubernetes/ovn-kubernetes#5429 are not in 4.19
@tssurya tssurya force-pushed the cherry-pick-bgp-fixes-into-4.19 branch from 25c7daf to 67994c4 Compare August 21, 2025 18:12
@jluhrsen
Copy link
Contributor

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 22, 2025

@tssurya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/lint a193351 link true /test lint
ci/prow/e2e-gcp-ovn a193351 link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-local-to-shared-gateway-mode-migration a193351 link true /test e2e-aws-ovn-local-to-shared-gateway-mode-migration
ci/prow/gofmt a193351 link true /test gofmt
ci/prow/e2e-vsphere-ovn a193351 link false /test e2e-vsphere-ovn
ci/prow/e2e-aws-ovn-edge-zones a193351 link true /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-local-gateway a193351 link true /test e2e-aws-ovn-local-gateway
ci/prow/okd-scos-e2e-aws-ovn a193351 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-hypershift-kubevirt 67994c4 link false /test e2e-aws-ovn-hypershift-kubevirt
ci/prow/e2e-aws-ovn-upgrade-ipsec 67994c4 link false /test e2e-aws-ovn-upgrade-ipsec
ci/prow/e2e-openstack-ovn a193351 link false /test e2e-openstack-ovn
ci/prow/e2e-vsphere-ovn-techpreview a193351 link false /test e2e-vsphere-ovn-techpreview
ci/prow/e2e-aws-ovn-hypershift-kubevirt a193351 link false /test e2e-aws-ovn-hypershift-kubevirt
ci/prow/e2e-aws-ovn-techpreview a193351 link false /test e2e-aws-ovn-techpreview
ci/prow/security a193351 link false /test security
ci/prow/e2e-aws-ovn-hypershift a193351 link true /test e2e-aws-ovn-hypershift
ci/prow/e2e-aws-ovn-serial a193351 link true /test e2e-aws-ovn-serial
ci/prow/4.20-upgrade-from-stable-4.19-images a193351 link true /test 4.20-upgrade-from-stable-4.19-images
ci/prow/e2e-gcp-ovn-techpreview a193351 link true /test e2e-gcp-ovn-techpreview
ci/prow/e2e-ovn-hybrid-step-registry a193351 link false /test e2e-ovn-hybrid-step-registry
ci/prow/4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade a193351 link true /test 4.20-upgrade-from-stable-4.19-e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn a193351 link true /test e2e-aws-ovn
ci/prow/lint 67994c4 link true /test lint
ci/prow/4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade a193351 link true /test 4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-hypershift-conformance-techpreview 67994c4 link false /test e2e-aws-ovn-hypershift-conformance-techpreview
ci/prow/qe-perfscale-payload-control-plane-6nodes a193351 link true /test qe-perfscale-payload-control-plane-6nodes
ci/prow/images a193351 link true /test images
ci/prow/e2e-aws-ovn-windows a193351 link true /test e2e-aws-ovn-windows
ci/prow/e2e-azure-ovn a193351 link false /test e2e-azure-ovn
ci/prow/unit a193351 link true /test unit
ci/prow/e2e-vsphere-ovn-techpreview 67994c4 link false /test e2e-vsphere-ovn-techpreview
ci/prow/e2e-aws-ovn-single-node-techpreview a193351 link false /test e2e-aws-ovn-single-node-techpreview
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview 67994c4 link false /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview
ci/prow/e2e-azure-ovn-techpreview a193351 link false /test e2e-azure-ovn-techpreview
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-techpreview 67994c4 link false /test e2e-metal-ipi-ovn-dualstack-bgp-techpreview
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration a193351 link true /test e2e-aws-ovn-shared-to-local-gateway-mode-migration
ci/prow/e2e-aws-ovn-serial-ipsec a193351 link false /test e2e-aws-ovn-serial-ipsec
ci/prow/e2e-aws-ovn-hypershift-conformance-techpreview a193351 link false /test e2e-aws-ovn-hypershift-conformance-techpreview
ci/prow/4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade-ipsec a193351 link false /test 4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade-ipsec
ci/prow/qe-perfscale-aws-ovn-small-udn-density-churn-l3 a193351 link false /test qe-perfscale-aws-ovn-small-udn-density-churn-l3
ci/prow/okd-scos-images a193351 link true /test okd-scos-images
ci/prow/qe-perfscale-aws-ovn-small-udn-density-l3 a193351 link false /test qe-perfscale-aws-ovn-small-udn-density-l3
ci/prow/security 67994c4 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tssurya
Copy link
Contributor Author

tssurya commented Aug 22, 2025

/hold

We need openshift/origin#30156 to land first else BGP sippy rates will drop on 4.19 which we don't want.
This is also why bgp lanes are red on this PR.
Once the origin PR get's merged, we can remove this hold - we'll need some dummy bugs for both this PR and for the origin PR.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 22, 2025
@tssurya
Copy link
Contributor Author

tssurya commented Aug 22, 2025

/hold

We need openshift/origin#30156 to land first else BGP sippy rates will drop on 4.19 which we don't want.
This is also why bgp lanes are red on this PR.
Once the origin PR get's merged, we can remove this hold

@jluhrsen
Copy link
Contributor

can we close this and just let it come in with #2733

@tssurya tssurya closed this Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants