Skip to content

[BPF] Defer kube-proxy construction until first hostIPs and in-sync#12667

Merged
tomastigera merged 1 commit into
projectcalico:masterfrom
tomastigera:tomas/gh-12192-np-broken-when-felix-restarts
May 5, 2026
Merged

[BPF] Defer kube-proxy construction until first hostIPs and in-sync#12667
tomastigera merged 1 commit into
projectcalico:masterfrom
tomastigera:tomas/gh-12192-np-broken-when-felix-restarts

Conversation

@tomastigera
Copy link
Copy Markdown
Contributor

Description

Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500 ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs.

Root cause

In that window, kube-proxy.go:start() constructed the proxy with a stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any (realHostIP, nodePort) FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run.

New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (nat_lookup.h:60-71 returns NULL on CALI_F_FROM_HEP without from_tun), so it provides no safety net.

The race window opened up in 3.31 because PR #11762 added enough setup work between scanner construction and Start() (channel allocation, the new WorkloadRemoveScannerTCP goroutine, extra EntryScanner wiring) to flip the apply-ordering relative to the kube-proxy bootstrap. Earlier versions had the same latent race but didn't trip it.

Fix

Defer proxy construction until both the first hostIPUpdates and the first hostMetadataUpdates have arrived, then construct the Syncer (with real host IPs) and the proxy in one shot via run(). The KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so in-flight conntrack scans during the wait remain safe.

Test

New regression test in kube-proxy_test.go pre-populates the front map with a real-host-IP NodePort FE entry, fires only the host-metadata gate (mimicking the bootstrap window), and asserts via Consistently over 500 ms that the entry survives. The test fails on the buggy code (Consistently trips at ~125 ms) and passes with the fix.

Related issues/PRs

Closes #12192

Release Note

ebpf - Fix transient NodePort connection failures when Felix restarts on a node receiving external NodePort traffic.

Reminder for the reviewer

  • release-note-required
  • docs-not-required (no user-facing config change; bug fix)
  • cherry-pick-candidate (bug fix; consider backport to active release branches)

Felix restart on a node receiving NodePort traffic was breaking new
TCP connections for ~500ms — the bootstrap window between kube-proxy
starting up and receiving the first hostIPs.

In that window, kube-proxy.go:start() constructed the proxy with a
stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up
the k8s informer goroutines synchronously; once they sync, an Apply
runs against the stub Syncer. The cachingmap computes desired state
without any (realHostIP, nodePort) FE entry and erases pre-existing
real-host-IP NodePort FE entries left by the previous Felix run. New
external TCP connections to the NodePort during the gap miss the FE
map, ascend to the host stack with no listener, and get RST by the
kernel. The wildcard FE entry does not cover external→HEP NodePort
traffic (nat_lookup.h:60-71) so it provides no safety net.

Defer proxy construction until both the first hostIPUpdates and the
first hostMetadataUpdates have arrived, then construct the Syncer
(with real host IPs) and the proxy in one shot via run(). The
KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so
in-flight conntrack scans during the wait remain safe.

Add a regression test in kube-proxy_test.go that pre-populates the
front map with a real-host-IP NodePort FE entry, fires only the
host-metadata gate (mimicking the bootstrap window), and asserts via
Consistently that the entry survives.

Closes projectcalico#12192
@marvin-tigera marvin-tigera added this to the Calico v3.33.0 milestone Apr 30, 2026
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Apr 30, 2026
@tomastigera tomastigera removed the docs-pr-required Change is not yet documented label Apr 30, 2026
@marvin-tigera marvin-tigera added the docs-pr-required Change is not yet documented label Apr 30, 2026
@tomastigera tomastigera added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Apr 30, 2026
@marvin-tigera marvin-tigera added docs-pr-required Change is not yet documented and removed docs-not-required Docs not required for this change labels Apr 30, 2026
@tomastigera tomastigera marked this pull request as ready for review May 1, 2026 16:28
@tomastigera tomastigera requested a review from a team as a code owner May 1, 2026 16:28
Copilot AI review requested due to automatic review settings May 1, 2026 16:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Felix/eBPF kube-proxy startup race that could briefly remove real-host-IP NodePort frontend (FE) entries during Felix restart, causing transient external NodePort connection failures until host IPs are known.

Changes:

  • Defer kube-proxy/proxy construction until both the first hostIPUpdates and the first hostMetadataUpdates have been received.
  • Update KubeProxy.run() to construct the proxy lazily on the first run() call (using real host IPs).
  • Add a regression test that ensures pre-existing real-host-IP NodePort FE entries are not erased during the bootstrap window.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
felix/bpf/proxy/kube-proxy.go Defers proxy construction until initial host IPs + host-metadata gate; adds nil-guard in Stop() and creates proxy in run() on first call.
felix/bpf/proxy/kube-proxy_test.go Adds regression test reproducing bootstrap-window FE erasure and asserting marker FE entry persists until host IPs are delivered.

Comment on lines 116 to 120
kp.lock.Lock()
defer kp.lock.Unlock()

close(kp.exiting)
close(kp.hostMetadataUpdates)
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop() holds kp.lock while it later waits on kp.wg.Wait(). If the update loop goroutine is in the middle of kp.run() (which also takes kp.lock) when Stop() is called, Stop can deadlock waiting for the goroutine to exit while the goroutine is blocked on the lock. Consider releasing kp.lock before waiting (e.g., take a local copy of kp.proxy under lock, unlock, then Stop proxy / waitgroup).

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +126
if kp.proxy != nil {
// kp.proxy is nil if Stop is called before start() has
// received its initial host IPs and host-metadata updates.
kp.proxy.Stop()
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new nil-check avoids a panic when Stop() is called before start() constructs kp.proxy, but it also makes it possible for Stop() to return without ever stopping a proxy that gets created concurrently shortly after this check (e.g., start() has already received initial updates and is about to assign kp.proxy). Consider preventing proxy construction once exiting is closed, or include the start() goroutine in kp.wg so Stop() waits for it to finish, ensuring a constructed proxy is always stopped.

Copilot uses AI. Check for mistakes.
@tomastigera tomastigera added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels May 1, 2026
@tomastigera tomastigera changed the title [ebpf] Defer kube-proxy construction until first hostIPs and in-sync [BPF] Defer kube-proxy construction until first hostIPs and in-sync May 4, 2026
@tomastigera tomastigera requested a review from aaaaaaaalex May 4, 2026 17:51
@tomastigera tomastigera merged commit 769c9d8 into projectcalico:master May 5, 2026
19 checks passed
@tomastigera tomastigera deleted the tomas/gh-12192-np-broken-when-felix-restarts branch May 5, 2026 21:43
tomastigera added a commit that referenced this pull request May 11, 2026
proxy.New() wires the syncer's expand-NodePort fixup callback to the
proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer()
swaps in a fresh syncer (e.g. on a host-IP change inside
KubeProxy.run()) but did not re-wire that callback. Result: after any
host-IP change the new syncer's expand-NodePort fixup goroutine
resolves previously-missed routes for ExternalTrafficPolicy=Local
NodePort backends, but cannot schedule the dataplane Apply that would
program them — the fix-up is silently lost until something else
piggybacks an Apply.

Pointed out by review on the v3.31 backport (#12693) of #12667; the
asymmetry is pre-existing (it was equally broken before #12667
restructured run()) but became plain to read after the restructure.

Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer
under the same lock as the dpSyncer swap, mirroring what proxy.New()
does for the initial syncer.

Add a regression test that swaps a syncer via SetSyncer and verifies
the new syncer received a non-nil trigger callback that schedules an
Apply when invoked. The test fails on master before this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tomastigera added a commit that referenced this pull request May 11, 2026
Felix restart on a node receiving NodePort traffic was breaking new
TCP connections for ~500ms — the bootstrap window between kube-proxy
starting up and receiving the first hostIPs.

In that window, kube-proxy.go:start() constructed the proxy with a
stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up
the k8s informer goroutines synchronously; once they sync, an Apply
runs against the stub Syncer. The cachingmap computes desired state
without any (realHostIP, nodePort) FE entry and erases pre-existing
real-host-IP NodePort FE entries left by the previous Felix run. New
external TCP connections to the NodePort during the gap miss the FE
map, ascend to the host stack with no listener, and get RST by the
kernel. The wildcard FE entry does not cover external→HEP NodePort
traffic (nat_lookup.h:60-71) so it provides no safety net.

Defer proxy construction until the first hostIPUpdates has arrived,
then construct the Syncer (with real host IPs) and the proxy in one
shot via run(). The KubeProxy.Conntrack* callbacks already nil-check
kp.syncer, so in-flight conntrack scans during the wait remain safe.

Backport of #12667 adapted for release-v3.31. Master uses an explicit
"host-metadata in-sync" gate (added in #11817) that doesn't exist on
this branch, so the cherry-pick gates only on the first hostIPUpdate.
That fully fixes the stub-Syncer case (the reported bug). A residual
race remains for multi-host-IP nodes if hostIPUpdates fires with a
partial set before the iface monitor's initial resync completes; the
master fix's metadata gate closes that gap and is not present here.

Add a regression test in kube-proxy_test.go that pre-populates the
front map with a real-host-IP NodePort FE entry, starts kube-proxy
without firing OnHostIPsUpdate (mimicking the bootstrap window), and
asserts via Consistently that the entry survives.

Closes #12192

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tomastigera added a commit that referenced this pull request May 12, 2026
proxy.New() wires the syncer's expand-NodePort fixup callback to the
proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer()
swaps in a fresh syncer (e.g. on a host-IP change inside
KubeProxy.run()) but did not re-wire that callback. Result: after any
host-IP change the new syncer's expand-NodePort fixup goroutine
resolves previously-missed routes for ExternalTrafficPolicy=Local
NodePort backends, but cannot schedule the dataplane Apply that would
program them — the fix-up is silently lost until something else
piggybacks an Apply.

Pointed out by review on the v3.31 backport (#12693) of #12667; the
asymmetry is pre-existing (it was equally broken before #12667
restructured run()) but became plain to read after the restructure.

Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer
under the same lock as the dpSyncer swap, mirroring what proxy.New()
does for the initial syncer.

Add a regression test that swaps a syncer via SetSyncer and verifies
the new syncer received a non-nil trigger callback that schedules an
Apply when invoked. The test fails on master before this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tomastigera added a commit that referenced this pull request May 19, 2026
proxy.New() wires the syncer's expand-NodePort fixup callback to the
proxy's runner via dp.SetTriggerFn(p.runner.Run). proxy.SetSyncer()
swaps in a fresh syncer (e.g. on a host-IP change inside
KubeProxy.run()) but did not re-wire that callback. Result: after any
host-IP change the new syncer's expand-NodePort fixup goroutine
resolves previously-missed routes for ExternalTrafficPolicy=Local
NodePort backends, but cannot schedule the dataplane Apply that would
program them — the fix-up is silently lost until something else
piggybacks an Apply.

Pointed out by review on the v3.31 backport (#12693) of #12667; the
asymmetry is pre-existing (it was equally broken before #12667
restructured run()) but became plain to read after the restructure.

Make SetSyncer call s.SetTriggerFn(p.runner.Run) on the new syncer
under the same lock as the dpSyncer swap, mirroring what proxy.New()
does for the initial syncer.

Add a regression test that swaps a syncer via SetSyncer and verifies
the new syncer received a non-nil trigger callback that schedules an
Apply when invoked. The test fails on master before this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-candidate docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bpf - connection to service refused on calico-node pod startup

4 participants