fix: [NPM] cleanup restarted pod stuck with no IP #1503

huntergregory · 2022-08-01T21:50:49Z

Overview

In AKS Windows Server '22 nodes with memory pressure, pods may restart and enter a perpetual Error state, where the pod is stuck in Running status with no assigned IP.

If pod A is stuck in this Error state, we should clean up kernel state referencing the old IP.

Fix

Enqueue updates for Pods Running with no IP. By existing controller logic:

Old Pod state will be cleaned up from the Data Plane.
The new Pod state will be ignored (and not requeue).

Other Issue

Will address #1729 in a separate PR.

vakalapa

Failed Pod ---- No Pod IP
1. Update event
we ARE NOT enqueuing

we will clean it up
2. Delete event
we are handling this event ! we will clean up the cached pod

Running pod --- No Pod IP ( We are not enqueuing.. need to enqueue)
1. New POD
we handle this, and then do not requeue (but this is behavior change needs TESTING)

2. Existing POD
delete cached NPM pod,

do not requeue. unless IP is present

Next:
update event with new IP: add pod

we just need to enqueue .... and not requeue for empty IP

huntergregory · 2022-08-18T23:30:24Z

npm/pkg/dataplane/dataplane_windows.go

+	// 1. pod A previously had IP i and EP x
+	// 2. pod A restarts w/ no ip AND NPM restarts AND pod B comes up with the same IP i and EP y
+	// 3. controller processes an update event for pod A with IP i before the update event for pod B with IP i, so pod A is wrongly assigned to EP y
+	if endpoint.isStalePodKey(pod.PodKey) {


ok to move this up here out of the if-unspecified-pod-key condition because it's impossible to have a podKey be assigned to the endpoint and be stale (endpoints only touched in this function and refreshPodEndpoints)

huntergregory · 2022-08-18T23:39:02Z

npm/pkg/dataplane/dataplane_windows.go

+		endpoint.podKey = unspecifiedPodKey
+
+		// remove all policies on the endpoint
+		if err := dp.policyMgr.ResetEndpoint(endpoint.id); err != nil {


imo, safer to remove all policies on the endpoint

vakalapa · 2022-08-19T17:29:21Z

npm/pkg/controlplane/controllers/v2/podController.go

@@ -340,10 +355,14 @@ func (c *PodController) syncAddedPod(podObj *corev1.Pod) error {
 		podObj.Name, podObj.Spec.NodeName, podObj.Labels, podObj.Status.PodIP)

 	if !util.IsIPV4(podObj.Status.PodIP) {
-		msg := fmt.Sprintf("[syncAddedPod] Error: ADD POD  [%s/%s/%s/%+v/%s] failed as the PodIP is not valid ipv4 address", podObj.Namespace,
+		msg := fmt.Sprintf("[syncAddedPod] Error: ADD POD  [%s/%s/%s/%+v] failed as the PodIP is not valid ipv4 address. ip: [%s]", podObj.Namespace,


Can we change the wording from error to Warning, failed to ignored ?

vakalapa · 2022-08-19T17:52:37Z

npm/pkg/dataplane/types.go

+	PodKey          string
+	PodIP           string
+	NodeName        string
+	MarkedForDelete bool


We can make this a lower case variable and have a function to check IsMarkedForDelete() so that accidently not get set in dp_win.go

vakalapa

Please merge only after testing the code path.

huntergregory · 2022-08-22T19:47:25Z

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

huntergregory · 2022-08-24T18:49:35Z

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

When queuing for Running Status only, there are only 4 update-with-empty-ip events for 736 regular update events (in a windows conformance run).

github-actions · 2022-12-06T00:01:12Z

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

huntergregory · 2022-12-16T22:01:02Z

There is no performance impact for pure Linux clusters.

As discussed above there is hardly an impact for clusters with Windows Server '22:

There are only 4 update-with-empty-ip events for 736 regular update events

Experiments

Steps:

Create deployment with 1 replica.
Scale to 2k replicas.
After a while, delete all 2k Pods. This causes 2k new Pods to be created too.

Experiment 1: Uptime SLA for API Server

Cluster: az aks create -g $rg -n $cluster --network-plugin azure --max-pods 250 -c 16 --uptime-sla
Pod Image: k8s.gcr.io/pause:3.2
- This Pod doesn't have any memory/CPU overhead.

Results

There were 2 update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 9
npm_controller_pod_event_total{operation="update"} 16025
npm_controller_pod_event_total{operation="update-with-empty-ip"} 2

Experiment 2: No Uptime SLA and Heavier Pod Image

Cluster downgraded: az aks update -g $rg -n $cluster --no-uptime-sla
Pod Image: k8s.gcr.io/e2e-test-images/agnhost:2.33
- Command: /agnhost serve-hostname --tcp --http=false --port "80"

Results

There were no update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 10
npm_controller_pod_event_total{operation="update"} 21391

huntergregory · 2023-01-03T20:12:49Z

/azp run

azure-pipelines · 2023-01-03T20:13:06Z

Azure Pipelines successfully started running 2 pipeline(s).

huntergregory · 2023-02-14T18:30:23Z

current test runs are successful excluding:

flake in conf stress (verified unrelated to the change)
HNS-related errors in windows cyc and conf

verified unrelated to the change by searching for "warning: ADD POD", a log line that would be hit in the control flow related to this change

* print statements * cleanup Running pod with empty IP * add log line * revert previous 3 commits * enqueue updates with empty IPs and add prometheus metric * fix lints * handle pod assigned to wrong endpoint edge case * log and update comment * UTs and fixed named port + build * reset entire endpoint regardless of cache * remove comment in dp.go * fix windows build issues * skip refreshing endpoints and address comments * only sync empty ip if pod running. add tmp log * undo special pod delete logic * reference GH issue * fix Windows UTs * remove prometheus metrics and a log --------- Co-authored-by: Vamsi Kalapala <[email protected]>

huntergregory added the npm Related to NPM. label Aug 1, 2022

huntergregory changed the title ~~fix: [NPM] update controller to handle failing pod that never receives another IP~~ fix: [NPM] cleanup running pod stuck with empty IP Aug 16, 2022

huntergregory marked this pull request as ready for review August 16, 2022 20:33

huntergregory requested a review from a team as a code owner August 16, 2022 20:33

huntergregory requested review from vakalapa and removed request for a team August 16, 2022 20:33

huntergregory changed the title ~~fix: [NPM] cleanup running pod stuck with empty IP~~ fix: [NPM] cleanup restarted pod stuck with no IP Aug 16, 2022

vakalapa requested changes Aug 16, 2022

View reviewed changes

huntergregory commented Aug 18, 2022

View reviewed changes

vakalapa reviewed Aug 19, 2022

View reviewed changes

huntergregory added 12 commits August 19, 2022 11:08

print statements

6f0e357

cleanup Running pod with empty IP

c8b13a6

add log line

f023473

revert previous 3 commits

d66c7f2

enqueue updates with empty IPs and add prometheus metric

56b7994

fix lints

66d3cc8

handle pod assigned to wrong endpoint edge case

b0c5bc0

log and update comment

5e708ce

UTs and fixed named port + build

f384257

reset entire endpoint regardless of cache

094069f

remove comment in dp.go

d154623

fix windows build issues

31ac386

huntergregory force-pushed the npm-controller-empty-ip branch from 174c2c9 to 31ac386 Compare August 19, 2022 18:08

skip refreshing endpoints and address comments

b6287cf

vakalapa previously approved these changes Aug 19, 2022

View reviewed changes

huntergregory added the do-not-merge label Aug 22, 2022

huntergregory dismissed vakalapa’s stale review via 8f4fd49 August 22, 2022 22:55

only sync empty ip if pod running. add tmp log

8f4fd49

github-actions bot added the stale Stale due to inactivity. label Dec 6, 2022

Merge branch 'master' into npm-controller-empty-ip

a1dac4f

huntergregory force-pushed the npm-controller-empty-ip branch 2 times, most recently from d37de87 to a065992 Compare December 15, 2022 22:27

undo special pod delete logic

398c99e

huntergregory force-pushed the npm-controller-empty-ip branch from a065992 to 398c99e Compare December 15, 2022 22:36

github-actions bot removed the stale Stale due to inactivity. label Dec 16, 2022

huntergregory added 2 commits December 16, 2022 14:02

reference GH issue

56f64bb

fix Windows UTs

05ab431

huntergregory force-pushed the npm-controller-empty-ip branch from 2aaa022 to beb6327 Compare December 16, 2022 22:19

remove prometheus metrics and a log

d76ae6a

huntergregory force-pushed the npm-controller-empty-ip branch from beb6327 to d76ae6a Compare December 16, 2022 22:21

vakalapa approved these changes Jan 5, 2023

View reviewed changes

Merge branch 'master' into npm-controller-empty-ip

aafa1f9

huntergregory removed the do-not-merge label Jan 20, 2023

Merge branch 'master' into npm-controller-empty-ip

d7ab36d

Merge branch 'master' into npm-controller-empty-ip

edf7c96

vakalapa merged commit 09cd371 into master Feb 15, 2023

vakalapa deleted the npm-controller-empty-ip branch February 15, 2023 21:38

huntergregory mentioned this pull request Feb 16, 2023

fix: [NPM-WIN] ability to reassign Pod associated with Endpoint #1806

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [NPM] cleanup restarted pod stuck with no IP #1503

fix: [NPM] cleanup restarted pod stuck with no IP #1503

huntergregory commented Aug 1, 2022 •

edited

Loading

vakalapa left a comment

huntergregory Aug 18, 2022

huntergregory Aug 18, 2022

vakalapa Aug 19, 2022

vakalapa Aug 19, 2022

vakalapa left a comment

huntergregory commented Aug 22, 2022

huntergregory commented Aug 24, 2022

github-actions bot commented Dec 6, 2022

huntergregory commented Dec 16, 2022

huntergregory commented Jan 3, 2023

azure-pipelines bot commented Jan 3, 2023

huntergregory commented Feb 14, 2023

fix: [NPM] cleanup restarted pod stuck with no IP #1503

fix: [NPM] cleanup restarted pod stuck with no IP #1503

Conversation

huntergregory commented Aug 1, 2022 • edited Loading

Overview

Fix

Other Issue

vakalapa left a comment

Choose a reason for hiding this comment

huntergregory Aug 18, 2022

Choose a reason for hiding this comment

huntergregory Aug 18, 2022

Choose a reason for hiding this comment

vakalapa Aug 19, 2022

Choose a reason for hiding this comment

vakalapa Aug 19, 2022

Choose a reason for hiding this comment

vakalapa left a comment

Choose a reason for hiding this comment

huntergregory commented Aug 22, 2022

huntergregory commented Aug 24, 2022

github-actions bot commented Dec 6, 2022

huntergregory commented Dec 16, 2022

Experiments

Experiment 1: Uptime SLA for API Server

Results

Experiment 2: No Uptime SLA and Heavier Pod Image

Results

huntergregory commented Jan 3, 2023

azure-pipelines bot commented Jan 3, 2023

huntergregory commented Feb 14, 2023

huntergregory commented Aug 1, 2022 •

edited

Loading