[v3.32] [BPF] Defer kube-proxy construction until first hostIPs and in-sync#12694
Merged
tomastigera merged 1 commit intoMay 6, 2026
Conversation
Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs. In that window, kube-proxy.go:start() constructed the proxy with a stub Syncer (only podNPIP, no real host IPs). proxy.New() spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any (realHostIP, nodePort) FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run. New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (nat_lookup.h:60-71) so it provides no safety net. Defer proxy construction until both the first hostIPUpdates and the first hostMetadataUpdates have arrived, then construct the Syncer (with real host IPs) and the proxy in one shot via run(). The KubeProxy.Conntrack* callbacks already nil-check kp.syncer, so in-flight conntrack scans during the wait remain safe. Add a regression test in kube-proxy_test.go that pre-populates the front map with a real-host-IP NodePort FE entry, fires only the host-metadata gate (mimicking the bootstrap window), and asserts via Consistently that the entry survives. Closes projectcalico#12192
Contributor
There was a problem hiding this comment.
Pull request overview
This PR backports the eBPF kube-proxy bootstrap fix to release-v3.32, preventing transient NodePort connection failures during Felix restart by deferring proxy construction until both (a) the first host IPs update and (b) the host-metadata “in-sync” signal have been received.
Changes:
- Delay initial proxy construction until initial
hostIPUpdatesandhostMetadataUpdatesare available, to avoid an early Apply against a syncer without real host IPs. - Make
Stop()tolerant of being called beforekp.proxyis constructed. - Add a regression test that pre-populates a real-host-IP NodePort FE entry and asserts it is not erased during the bootstrap window.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
felix/bpf/proxy/kube-proxy.go |
Defers initial proxy construction until initial host IP + host-metadata updates; adds nil-guard in Stop(). |
felix/bpf/proxy/kube-proxy_test.go |
Adds regression test for NodePort FE preservation during the bootstrap window; imports NAT helpers. |
Comment on lines
199
to
235
| func (kp *KubeProxy) start() error { | ||
| var withLocalNP []net.IP | ||
| if kp.ipFamily == 4 { | ||
| withLocalNP = append(withLocalNP, podNPIP) | ||
| } else { | ||
| withLocalNP = append(withLocalNP, podNPIPV6) | ||
| } | ||
|
|
||
| syncer, err := NewSyncer(kp.ipFamily, withLocalNP, kp.frontendMap, kp.backendMap, kp.MaglevMap, kp.affinityMap, kp.rt, kp.excludedCIDRs, kp.maglevLUTSize) | ||
| if err != nil { | ||
| return errors.WithMessage(err, "new bpf syncer") | ||
| } | ||
|
|
||
| proxy, err := New(kp.k8s, syncer, kp.hostname, kp.opts...) | ||
| if err != nil { | ||
| return errors.WithMessage(err, "new proxy") | ||
| // Block until we have the first batch of host IPs AND the | ||
| // host-metadata in-sync signal (sent by CompleteDeferredWork after | ||
| // the int_dataplane has finished its first datastore-in-sync | ||
| // apply). Only then construct the proxy, via run(). proxy.New() | ||
| // kicks off the k8s informer goroutines synchronously; once those | ||
| // sync, they trigger Apply on the syncer. Constructing the proxy | ||
| // before we have real host IPs lets that Apply run against a | ||
| // syncer whose desired state lacks every (realHostIP, nodePort) | ||
| // FE entry, which then erases pre-existing entries left by the | ||
| // previous Felix run and breaks external NodePort traffic during | ||
| // the kube-proxy bootstrap window. See projectcalico/calico#12192. | ||
| var hostIPs []net.IP | ||
| select { | ||
| case ips, ok := <-kp.hostIPUpdates: | ||
| if !ok { | ||
| return nil | ||
| } | ||
| hostIPs = ips | ||
| case <-kp.exiting: | ||
| return nil | ||
| } | ||
|
|
||
| kp.lock.Lock() | ||
| kp.proxy = proxy | ||
| kp.syncer = syncer | ||
| kp.lock.Unlock() | ||
|
|
||
| // Wait for the initial update. | ||
| hostIPs := <-kp.hostIPUpdates | ||
|
|
||
| hostMetadata := make(map[string]*proto.HostMetadataV4V6Update) | ||
| // Block until we go in-sync and get the first batch of hostmetadata | ||
| // updates, to avoid a flap after a Felix restart. In practice, this | ||
| // recv should happen very soon after receiving the host IPs above. | ||
| hostMetadataUpdates := <-kp.hostMetadataUpdates | ||
| mergeHostMetadataV4V6Updates(hostMetadata, hostMetadataUpdates) | ||
| select { | ||
| case updates, ok := <-kp.hostMetadataUpdates: | ||
| if !ok { | ||
| return nil | ||
| } | ||
| mergeHostMetadataV4V6Updates(hostMetadata, updates) | ||
| case <-kp.exiting: | ||
| return nil | ||
| } | ||
|
|
||
| err = kp.run(hostIPs, hostMetadata) | ||
| if err != nil { | ||
| if err := kp.run(hostIPs, hostMetadata); err != nil { | ||
| return err | ||
| } |
This was referenced May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Cherry-pick of #12667 to
release-v3.32. Felix restart on a node receiving NodePort traffic was breaking new TCP connections for ~500 ms — the bootstrap window between kube-proxy starting up and receiving the first hostIPs.Root cause
In that window,
kube-proxy.go:start()constructed the proxy with a stubSyncer(onlypodNPIP, no real host IPs).proxy.New()spins up the k8s informer goroutines synchronously; once they sync, an Apply runs against the stub Syncer. The cachingmap computes desired state without any(realHostIP, nodePort)FE entry and erases pre-existing real-host-IP NodePort FE entries left by the previous Felix run.New external TCP connections to the NodePort during the gap miss the FE map, ascend to the host stack with no listener, and get RST by the kernel. The wildcard FE entry does not cover external→HEP NodePort traffic (
nat_lookup.h:60-71returns NULL onCALI_F_FROM_HEPwithoutfrom_tun), so it provides no safety net.Fix
Defer proxy construction until both the first
hostIPUpdatesand the firsthostMetadataUpdateshave arrived, then construct theSyncer(with real host IPs) and the proxy in one shot viarun(). TheKubeProxy.Conntrack*callbacks already nil-checkkp.syncer, so in-flight conntrack scans during the wait remain safe.Test
Regression test in
kube-proxy_test.gopre-populates the front map with a real-host-IP NodePort FE entry, fires only the host-metadata gate (mimicking the bootstrap window), and asserts viaConsistentlyover 500 ms that the entry survives. The test fails on the buggy code (Consistentlytrips at ~125 ms) and passes with the fix.End-to-end validated on a 3.31 BPF cluster with the equivalent fix; cherry-picks to 3.32 cleanly (3.32 already has #11817's host-metadata gate plumbing).
Related issues/PRs
Closes #12192. Cherry-pick of #12667.
Release Note
Reminder for the reviewer
release-note-requireddocs-not-required(no user-facing config change; bug fix)