[BPF] Defer kube-proxy construction until first hostIPs and in-sync#12667
Conversation
Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs. In that window, kube-proxy.go:start() constructed the proxy with a stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any (realHostIP, nodePort) FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run. New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (nat_lookup.h:60-71) so it provides no safety net. Defer proxy construction until both the first hostIPUpdates and the first hostMetadataUpdates have arrived, then construct the Syncer (with real host IPs) and the proxy in one shot via run(). The KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so in-flight conntrack scans during the wait remain safe. Add a regression test in kube-proxy_test.go that pre-populates the front map with a real-host-IP NodePort FE entry, fires only the host-metadata gate (mimicking the bootstrap window), and asserts via Consistently that the entry survives. Closes projectcalico#12192
There was a problem hiding this comment.
Pull request overview
This PR fixes a Felix/eBPF kube-proxy startup race that could briefly remove real-host-IP NodePort frontend (FE) entries during Felix restart, causing transient external NodePort connection failures until host IPs are known.
Changes:
- Defer kube-proxy/proxy construction until both the first
hostIPUpdatesand the firsthostMetadataUpdateshave been received. - Update
KubeProxy.run()to construct the proxy lazily on the firstrun()call (using real host IPs). - Add a regression test that ensures pre-existing real-host-IP NodePort FE entries are not erased during the bootstrap window.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| felix/bpf/proxy/kube-proxy.go | Defers proxy construction until initial host IPs + host-metadata gate; adds nil-guard in Stop() and creates proxy in run() on first call. |
| felix/bpf/proxy/kube-proxy_test.go | Adds regression test reproducing bootstrap-window FE erasure and asserting marker FE entry persists until host IPs are delivered. |
| kp.lock.Lock() | ||
| defer kp.lock.Unlock() | ||
|
|
||
| close(kp.exiting) | ||
| close(kp.hostMetadataUpdates) |
There was a problem hiding this comment.
Stop() holds kp.lock while it later waits on kp.wg.Wait(). If the update loop goroutine is in the middle of kp.run() (which also takes kp.lock) when Stop() is called, Stop can deadlock waiting for the goroutine to exit while the goroutine is blocked on the lock. Consider releasing kp.lock before waiting (e.g., take a local copy of kp.proxy under lock, unlock, then Stop proxy / waitgroup).
| if kp.proxy != nil { | ||
| // kp.proxy is nil if Stop is called before start() has | ||
| // received its initial host IPs and host-metadata updates. | ||
| kp.proxy.Stop() | ||
| } |
There was a problem hiding this comment.
The new nil-check avoids a panic when Stop() is called before start() constructs kp.proxy, but it also makes it possible for Stop() to return without ever stopping a proxy that gets created concurrently shortly after this check (e.g., start() has already received initial updates and is about to assign kp.proxy). Consider preventing proxy construction once exiting is closed, or include the start() goroutine in kp.wg so Stop() waits for it to finish, ensuring a constructed proxy is always stopped.
proxy.New() wires the syncer's expand-NodePort fixup callback to the proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer() swaps in a fresh syncer (e.g. on a host-IP change inside KubeProxy.run()) but did not re-wire that callback. Result: after any host-IP change the new syncer's expand-NodePort fixup goroutine resolves previously-missed routes for ExternalTrafficPolicy=Local NodePort backends, but cannot schedule the dataplane Apply that would program them — the fix-up is silently lost until something else piggybacks an Apply. Pointed out by review on the v3.31 backport (#12693) of #12667; the asymmetry is pre-existing (it was equally broken before #12667 restructured run()) but became plain to read after the restructure. Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer under the same lock as the dpSyncer swap, mirroring what proxy.New() does for the initial syncer. Add a regression test that swaps a syncer via SetSyncer and verifies the new syncer received a non-nil trigger callback that schedules an Apply when invoked. The test fails on master before this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs. In that window, kube-proxy.go:start() constructed the proxy with a stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any (realHostIP, nodePort) FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run. New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (nat_lookup.h:60-71) so it provides no safety net. Defer proxy construction until the first hostIPUpdates has arrived, then construct the Syncer (with real host IPs) and the proxy in one shot via run(). The KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so in-flight conntrack scans during the wait remain safe. Backport of #12667 adapted for release-v3.31. Master uses an explicit "host-metadata in-sync" gate (added in #11817) that doesn't exist on this branch, so the cherry-pick gates only on the first hostIPUpdate. That fully fixes the stub-Syncer case (the reported bug). A residual race remains for multi-host-IP nodes if hostIPUpdates fires with a partial set before the iface monitor's initial resync completes; the master fix's metadata gate closes that gap and is not present here. Add a regression test in kube-proxy_test.go that pre-populates the front map with a real-host-IP NodePort FE entry, starts kube-proxy without firing OnHostIPsUpdate (mimicking the bootstrap window), and asserts via Consistently that the entry survives. Closes #12192 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
proxy.New() wires the syncer's expand-NodePort fixup callback to the proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer() swaps in a fresh syncer (e.g. on a host-IP change inside KubeProxy.run()) but did not re-wire that callback. Result: after any host-IP change the new syncer's expand-NodePort fixup goroutine resolves previously-missed routes for ExternalTrafficPolicy=Local NodePort backends, but cannot schedule the dataplane Apply that would program them — the fix-up is silently lost until something else piggybacks an Apply. Pointed out by review on the v3.31 backport (#12693) of #12667; the asymmetry is pre-existing (it was equally broken before #12667 restructured run()) but became plain to read after the restructure. Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer under the same lock as the dpSyncer swap, mirroring what proxy.New() does for the initial syncer. Add a regression test that swaps a syncer via SetSyncer and verifies the new syncer received a non-nil trigger callback that schedules an Apply when invoked. The test fails on master before this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
proxy.New() wires the syncer's expand-NodePort fixup callback to the proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer() swaps in a fresh syncer (e.g. on a host-IP change inside KubeProxy.run()) but did not re-wire that callback. Result: after any host-IP change the new syncer's expand-NodePort fixup goroutine resolves previously-missed routes for ExternalTrafficPolicy=Local NodePort backends, but cannot schedule the dataplane Apply that would program them — the fix-up is silently lost until something else piggybacks an Apply. Pointed out by review on the v3.31 backport (#12693) of #12667; the asymmetry is pre-existing (it was equally broken before #12667 restructured run()) but became plain to read after the restructure. Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer under the same lock as the dpSyncer swap, mirroring what proxy.New() does for the initial syncer. Add a regression test that swaps a syncer via SetSyncer and verifies the new syncer received a non-nil trigger callback that schedules an Apply when invoked. The test fails on master before this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Description
Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500 ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs.
Root cause
In that window,
kube-proxy.go:start()constructed the proxy with a stubSyncer(onlypodNPIP, no real host IPs).proxy.New()spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any(realHostIP, nodePort)FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run.New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (
nat_lookup.h:60-71returns NULL onCALI_F_FROM_HEPwithoutfrom_tun), so it provides no safety net.The race window opened up in 3.31 because PR #11762 added enough setup work between scanner construction and
Start()(channel allocation, the newWorkloadRemoveScannerTCPgoroutine, extraEntryScannerwiring) to flip the apply-ordering relative to the kube-proxy bootstrap. Earlier versions had the same latent race but didn't trip it.Fix
Defer proxy construction until both the first
hostIPUpdatesand the firsthostMetadataUpdateshave arrived, then construct theSyncer(with real host IPs) and the proxy in one shot viarun(). TheKubeProxy.Conntrack*callbacks already nil-checkkp.syncer, so in-flight conntrack scans during the wait remain safe.Test
New regression test in
kube-proxy_test.gopre-populates the front map with a real-host-IP NodePort FE entry, fires only the host-metadata gate (mimicking the bootstrap window), and asserts viaConsistentlyover 500 ms that the entry survives. The test fails on the buggy code (Consistentlytrips at ~125 ms) and passes with the fix.Related issues/PRs
Closes #12192
Release Note
Reminder for the reviewer
release-note-requireddocs-not-required(no user-facing config change; bug fix)cherry-pick-candidate(bug fix; consider backport to active release branches)