-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: [NPM] cleanup restarted pod stuck with no IP #1503
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Failed Pod ---- No Pod IP
1. Update event
we ARE NOT enqueuing
we will clean it up
2. Delete event
we are handling this event ! we will clean up the cached pod
- Running pod --- No Pod IP ( We are not enqueuing.. need to enqueue)
1. New POD
we handle this, and then do not requeue (but this is behavior change needs TESTING)
2. Existing POD
delete cached NPM pod,
do not requeue. unless IP is present
Next:
update event with new IP: add pod
we just need to enqueue .... and not requeue for empty IP
// 1. pod A previously had IP i and EP x | ||
// 2. pod A restarts w/ no ip AND NPM restarts AND pod B comes up with the same IP i and EP y | ||
// 3. controller processes an update event for pod A with IP i before the update event for pod B with IP i, so pod A is wrongly assigned to EP y | ||
if endpoint.isStalePodKey(pod.PodKey) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok to move this up here out of the if-unspecified-pod-key condition because it's impossible to have a podKey be assigned to the endpoint and be stale (endpoints only touched in this function and refreshPodEndpoints)
endpoint.podKey = unspecifiedPodKey | ||
|
||
// remove all policies on the endpoint | ||
if err := dp.policyMgr.ResetEndpoint(endpoint.id); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, safer to remove all policies on the endpoint
@@ -340,10 +355,14 @@ func (c *PodController) syncAddedPod(podObj *corev1.Pod) error { | |||
podObj.Name, podObj.Spec.NodeName, podObj.Labels, podObj.Status.PodIP) | |||
|
|||
if !util.IsIPV4(podObj.Status.PodIP) { | |||
msg := fmt.Sprintf("[syncAddedPod] Error: ADD POD [%s/%s/%s/%+v/%s] failed as the PodIP is not valid ipv4 address", podObj.Namespace, | |||
msg := fmt.Sprintf("[syncAddedPod] Error: ADD POD [%s/%s/%s/%+v] failed as the PodIP is not valid ipv4 address. ip: [%s]", podObj.Namespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the wording from error to Warning, failed to ignored ?
npm/pkg/dataplane/types.go
Outdated
PodKey string | ||
PodIP string | ||
NodeName string | ||
MarkedForDelete bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this a lower case variable and have a function to check IsMarkedForDelete() so that accidently not get set in dp_win.go
174c2c9
to
31ac386
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge only after testing the code path.
Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact. |
When queuing for |
This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
d37de87
to
a065992
Compare
a065992
to
398c99e
Compare
There is no performance impact for pure Linux clusters. As discussed above there is hardly an impact for clusters with Windows Server '22:
ExperimentsSteps:
Experiment 1: Uptime SLA for API Server
ResultsThere were 2 update-with-empty-ip events.
Experiment 2: No Uptime SLA and Heavier Pod Image
ResultsThere were no update-with-empty-ip events.
|
2aaa022
to
beb6327
Compare
beb6327
to
d76ae6a
Compare
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
current test runs are successful excluding:
verified unrelated to the change by searching for "warning: ADD POD", a log line that would be hit in the control flow related to this change |
* print statements * cleanup Running pod with empty IP * add log line * revert previous 3 commits * enqueue updates with empty IPs and add prometheus metric * fix lints * handle pod assigned to wrong endpoint edge case * log and update comment * UTs and fixed named port + build * reset entire endpoint regardless of cache * remove comment in dp.go * fix windows build issues * skip refreshing endpoints and address comments * only sync empty ip if pod running. add tmp log * undo special pod delete logic * reference GH issue * fix Windows UTs * remove prometheus metrics and a log --------- Co-authored-by: Vamsi Kalapala <[email protected]>
Overview
In AKS Windows Server '22 nodes with memory pressure, pods may restart and enter a perpetual Error state, where the pod is stuck in Running status with no assigned IP.
If pod A is stuck in this Error state, we should clean up kernel state referencing the old IP.
Fix
Enqueue updates for Pods Running with no IP. By existing controller logic:
Other Issue
Will address #1729 in a separate PR.