[BPF] Defer kube-proxy construction until first hostIPs and in-sync by tomastigera · Pull Request #12667 · projectcalico/calico

tomastigera · 2026-04-30T21:16:56Z

Description

Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500 ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs.

Root cause

In that window, kube-proxy.go:start() constructed the proxy with a stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any (realHostIP, nodePort) FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run.

New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (nat_lookup.h:60-71 returns NULL on CALI_F_FROM_HEP without from_tun), so it provides no safety net.

The race window opened up in 3.31 because PR #11762 added enough setup work between scanner construction and Start() (channel allocation, the new WorkloadRemoveScannerTCP goroutine, extra EntryScanner wiring) to flip the apply-ordering relative to the kube-proxy bootstrap. Earlier versions had the same latent race but didn't trip it.

Fix

Defer proxy construction until both the first hostIPUpdates and the first hostMetadataUpdates have arrived, then construct the Syncer (with real host IPs) and the proxy in one shot via run(). The KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so in-flight conntrack scans during the wait remain safe.

Test

New regression test in kube-proxy_test.go pre-populates the front map with a real-host-IP NodePort FE entry, fires only the host-metadata gate (mimicking the bootstrap window), and asserts via Consistently over 500 ms that the entry survives. The test fails on the buggy code (Consistently trips at ~125 ms) and passes with the fix.

Related issues/PRs

Closes #12192

Release Note

ebpf - Fix transient NodePort connection failures when Felix restarts on a node receiving external NodePort traffic.

Reminder for the reviewer

release-note-required
docs-not-required (no user-facing config change; bug fix)
cherry-pick-candidate (bug fix; consider backport to active release branches)

Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs. In that window, kube-proxy.go:start() constructed the proxy with a stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any (realHostIP, nodePort) FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run. New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (nat_lookup.h:60-71) so it provides no safety net. Defer proxy construction until both the first hostIPUpdates and the first hostMetadataUpdates have arrived, then construct the Syncer (with real host IPs) and the proxy in one shot via run(). The KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so in-flight conntrack scans during the wait remain safe. Add a regression test in kube-proxy_test.go that pre-populates the front map with a real-host-IP NodePort FE entry, fires only the host-metadata gate (mimicking the bootstrap window), and asserts via Consistently that the entry survives. Closes projectcalico#12192

Copilot

Pull request overview

This PR fixes a Felix/eBPF kube-proxy startup race that could briefly remove real-host-IP NodePort frontend (FE) entries during Felix restart, causing transient external NodePort connection failures until host IPs are known.

Changes:

Defer kube-proxy/proxy construction until both the first hostIPUpdates and the first hostMetadataUpdates have been received.
Update KubeProxy.run() to construct the proxy lazily on the first run() call (using real host IPs).
Add a regression test that ensures pre-existing real-host-IP NodePort FE entries are not erased during the bootstrap window.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
felix/bpf/proxy/kube-proxy.go	Defers proxy construction until initial host IPs + host-metadata gate; adds nil-guard in `Stop()` and creates proxy in `run()` on first call.
felix/bpf/proxy/kube-proxy_test.go	Adds regression test reproducing bootstrap-window FE erasure and asserting marker FE entry persists until host IPs are delivered.

Copilot · 2026-05-01T16:34:52Z

 		kp.lock.Lock()
 		defer kp.lock.Unlock()

 		close(kp.exiting)
 		close(kp.hostMetadataUpdates)


Stop() holds kp.lock while it later waits on kp.wg.Wait(). If the update loop goroutine is in the middle of kp.run() (which also takes kp.lock) when Stop() is called, Stop can deadlock waiting for the goroutine to exit while the goroutine is blocked on the lock. Consider releasing kp.lock before waiting (e.g., take a local copy of kp.proxy under lock, unlock, then Stop proxy / waitgroup).

Copilot · 2026-05-01T16:34:52Z

+		if kp.proxy != nil {
+			// kp.proxy is nil if Stop is called before start() has
+			// received its initial host IPs and host-metadata updates.
+			kp.proxy.Stop()
+		}


The new nil-check avoids a panic when Stop() is called before start() constructs kp.proxy, but it also makes it possible for Stop() to return without ever stopping a proxy that gets created concurrently shortly after this check (e.g., start() has already received initial updates and is about to assign kp.proxy). Consider preventing proxy construction once exiting is closed, or include the start() goroutine in kp.wg so Stop() waits for it to finish, ensuring a constructed proxy is always stopped.

proxy.New() wires the syncer's expand-NodePort fixup callback to the proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer() swaps in a fresh syncer (e.g. on a host-IP change inside KubeProxy.run()) but did not re-wire that callback. Result: after any host-IP change the new syncer's expand-NodePort fixup goroutine resolves previously-missed routes for ExternalTrafficPolicy=Local NodePort backends, but cannot schedule the dataplane Apply that would program them — the fix-up is silently lost until something else piggybacks an Apply. Pointed out by review on the v3.31 backport (#12693) of #12667; the asymmetry is pre-existing (it was equally broken before #12667 restructured run()) but became plain to read after the restructure. Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer under the same lock as the dpSyncer swap, mirroring what proxy.New() does for the initial syncer. Add a regression test that swaps a syncer via SetSyncer and verifies the new syncer received a non-nil trigger callback that schedules an Apply when invoked. The test fails on master before this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs. In that window, kube-proxy.go:start() constructed the proxy with a stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any (realHostIP, nodePort) FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run. New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (nat_lookup.h:60-71) so it provides no safety net. Defer proxy construction until the first hostIPUpdates has arrived, then construct the Syncer (with real host IPs) and the proxy in one shot via run(). The KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so in-flight conntrack scans during the wait remain safe. Backport of #12667 adapted for release-v3.31. Master uses an explicit "host-metadata in-sync" gate (added in #11817) that doesn't exist on this branch, so the cherry-pick gates only on the first hostIPUpdate. That fully fixes the stub-Syncer case (the reported bug). A residual race remains for multi-host-IP nodes if hostIPUpdates fires with a partial set before the iface monitor's initial resync completes; the master fix's metadata gate closes that gap and is not present here. Add a regression test in kube-proxy_test.go that pre-populates the front map with a real-host-IP NodePort FE entry, starts kube-proxy without firing OnHostIPsUpdate (mimicking the bootstrap window), and asserts via Consistently that the entry survives. Closes #12192 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

proxy.New() wires the syncer's expand-NodePort fixup callback to the proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer() swaps in a fresh syncer (e.g. on a host-IP change inside KubeProxy.run()) but did not re-wire that callback. Result: after any host-IP change the new syncer's expand-NodePort fixup goroutine resolves previously-missed routes for ExternalTrafficPolicy=Local NodePort backends, but cannot schedule the dataplane Apply that would program them — the fix-up is silently lost until something else piggybacks an Apply. Pointed out by review on the v3.31 backport (#12693) of #12667; the asymmetry is pre-existing (it was equally broken before #12667 restructured run()) but became plain to read after the restructure. Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer under the same lock as the dpSyncer swap, mirroring what proxy.New() does for the initial syncer. Add a regression test that swaps a syncer via SetSyncer and verifies the new syncer received a non-nil trigger callback that schedules an Apply when invoked. The test fails on master before this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

marvin-tigera added this to the Calico v3.33.0 milestone Apr 30, 2026

marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 30, 2026

tomastigera removed the docs-pr-required Change is not yet documented label Apr 30, 2026

marvin-tigera added the docs-pr-required Change is not yet documented label Apr 30, 2026

tomastigera added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Apr 30, 2026

marvin-tigera added docs-pr-required Change is not yet documented and removed docs-not-required Docs not required for this change labels Apr 30, 2026

aaaaaaaalex mentioned this pull request May 1, 2026

Fix: start-of-day syncer could apply state without host IPs (broke NP frontends) #12675

Closed

tomastigera marked this pull request as ready for review May 1, 2026 16:28

tomastigera requested a review from a team as a code owner May 1, 2026 16:28

Copilot AI review requested due to automatic review settings May 1, 2026 16:28

Copilot started reviewing on behalf of tomastigera May 1, 2026 16:28 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

tomastigera added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels May 1, 2026

tomastigera changed the title ~~[ebpf] Defer kube-proxy construction until first hostIPs and in-sync~~ [BPF] Defer kube-proxy construction until first hostIPs and in-sync May 4, 2026

tomastigera added the cherry-pick-candidate label May 4, 2026

tomastigera assigned aaaaaaaalex May 4, 2026

tomastigera requested a review from aaaaaaaalex May 4, 2026 17:51

aaaaaaaalex approved these changes May 5, 2026

View reviewed changes

tomastigera mentioned this pull request May 5, 2026

[v3.31] [BPF] Defer kube-proxy construction until first hostIPs #12693

Merged

2 tasks

tomastigera merged commit 769c9d8 into projectcalico:master May 5, 2026
19 checks passed

tomastigera deleted the tomas/gh-12192-np-broken-when-felix-restarts branch May 5, 2026 21:43

This was referenced May 5, 2026

[v3.32] [BPF] Defer kube-proxy construction until first hostIPs and in-sync #12694

Merged

[BPF] Re-wire SetTriggerFn on syncer swap in proxy.SetSyncer #12701

Merged

This was referenced May 12, 2026

[v3.32] [BPF] Re-wire SetTriggerFn on syncer swap in proxy.SetSyncer #12743

Merged

[v3.31] [BPF] Re-wire SetTriggerFn on syncer swap in proxy.SetSyncer #12744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BPF] Defer kube-proxy construction until first hostIPs and in-sync#12667

[BPF] Defer kube-proxy construction until first hostIPs and in-sync#12667
tomastigera merged 1 commit into
projectcalico:masterfrom
tomastigera:tomas/gh-12192-np-broken-when-felix-restarts

tomastigera commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tomastigera commented Apr 30, 2026

Description

Root cause

Fix

Test

Related issues/PRs

Release Note

Reminder for the reviewer

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants