Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: [NPM] cleanup restarted pod stuck with no IP #1503

Merged
merged 22 commits into from
Feb 15, 2023

Conversation

huntergregory
Copy link
Contributor

@huntergregory huntergregory commented Aug 1, 2022

Overview

In AKS Windows Server '22 nodes with memory pressure, pods may restart and enter a perpetual Error state, where the pod is stuck in Running status with no assigned IP.

If pod A is stuck in this Error state, we should clean up kernel state referencing the old IP.

Fix

Enqueue updates for Pods Running with no IP. By existing controller logic:

  1. Old Pod state will be cleaned up from the Data Plane.
  2. The new Pod state will be ignored (and not requeue).

Other Issue

Will address #1729 in a separate PR.

@huntergregory huntergregory added the npm Related to NPM. label Aug 1, 2022
@huntergregory huntergregory changed the title fix: [NPM] update controller to handle failing pod that never receives another IP fix: [NPM] cleanup running pod stuck with empty IP Aug 16, 2022
@huntergregory huntergregory marked this pull request as ready for review August 16, 2022 20:33
@huntergregory huntergregory requested a review from a team as a code owner August 16, 2022 20:33
@huntergregory huntergregory requested review from vakalapa and removed request for a team August 16, 2022 20:33
@huntergregory huntergregory changed the title fix: [NPM] cleanup running pod stuck with empty IP fix: [NPM] cleanup restarted pod stuck with no IP Aug 16, 2022
Copy link
Contributor

@vakalapa vakalapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Failed Pod ---- No Pod IP
        1. Update event
            we ARE NOT enqueuing

        we will clean it up
    2. Delete event
        we are handling this event ! we will clean up the cached pod

  1. Running pod --- No Pod IP ( We are not enqueuing.. need to enqueue)
        1. New POD
            we handle this, and then do not requeue (but this is behavior change needs TESTING)

    2. Existing POD
        delete cached NPM pod,

        do not requeue. unless IP is present

        Next:
            update event with new IP: add pod

we just need to enqueue .... and not requeue for empty IP

// 1. pod A previously had IP i and EP x
// 2. pod A restarts w/ no ip AND NPM restarts AND pod B comes up with the same IP i and EP y
// 3. controller processes an update event for pod A with IP i before the update event for pod B with IP i, so pod A is wrongly assigned to EP y
if endpoint.isStalePodKey(pod.PodKey) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to move this up here out of the if-unspecified-pod-key condition because it's impossible to have a podKey be assigned to the endpoint and be stale (endpoints only touched in this function and refreshPodEndpoints)

endpoint.podKey = unspecifiedPodKey

// remove all policies on the endpoint
if err := dp.policyMgr.ResetEndpoint(endpoint.id); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, safer to remove all policies on the endpoint

@@ -340,10 +355,14 @@ func (c *PodController) syncAddedPod(podObj *corev1.Pod) error {
podObj.Name, podObj.Spec.NodeName, podObj.Labels, podObj.Status.PodIP)

if !util.IsIPV4(podObj.Status.PodIP) {
msg := fmt.Sprintf("[syncAddedPod] Error: ADD POD [%s/%s/%s/%+v/%s] failed as the PodIP is not valid ipv4 address", podObj.Namespace,
msg := fmt.Sprintf("[syncAddedPod] Error: ADD POD [%s/%s/%s/%+v] failed as the PodIP is not valid ipv4 address. ip: [%s]", podObj.Namespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the wording from error to Warning, failed to ignored ?

PodKey string
PodIP string
NodeName string
MarkedForDelete bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this a lower case variable and have a function to check IsMarkedForDelete() so that accidently not get set in dp_win.go

vakalapa
vakalapa previously approved these changes Aug 19, 2022
Copy link
Contributor

@vakalapa vakalapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge only after testing the code path.

@huntergregory
Copy link
Contributor Author

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

@huntergregory
Copy link
Contributor Author

Current design leads to ~65% increase in controller workqueue updates. Should change design since this may have a significant memory impact.

When queuing for Running Status only, there are only 4 update-with-empty-ip events for 736 regular update events (in a windows conformance run).

@github-actions
Copy link

github-actions bot commented Dec 6, 2022

This pull request is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Dec 6, 2022
@huntergregory huntergregory force-pushed the npm-controller-empty-ip branch 2 times, most recently from d37de87 to a065992 Compare December 15, 2022 22:27
@github-actions github-actions bot removed the stale Stale due to inactivity. label Dec 16, 2022
@huntergregory
Copy link
Contributor Author

There is no performance impact for pure Linux clusters.

As discussed above there is hardly an impact for clusters with Windows Server '22:

There are only 4 update-with-empty-ip events for 736 regular update events

Experiments

Steps:

  1. Create deployment with 1 replica.
  2. Scale to 2k replicas.
  3. After a while, delete all 2k Pods. This causes 2k new Pods to be created too.

Experiment 1: Uptime SLA for API Server

  • Cluster: az aks create -g $rg -n $cluster --network-plugin azure --max-pods 250 -c 16 --uptime-sla
  • Pod Image: k8s.gcr.io/pause:3.2
    • This Pod doesn't have any memory/CPU overhead.

Results

There were 2 update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 9
npm_controller_pod_event_total{operation="update"} 16025
npm_controller_pod_event_total{operation="update-with-empty-ip"} 2

Experiment 2: No Uptime SLA and Heavier Pod Image

  • Cluster downgraded: az aks update -g $rg -n $cluster --no-uptime-sla
  • Pod Image: k8s.gcr.io/e2e-test-images/agnhost:2.33
    • Command: /agnhost serve-hostname --tcp --http=false --port "80"

Results

There were no update-with-empty-ip events.

npm_controller_pod_event_total{operation="create"} 10
npm_controller_pod_event_total{operation="update"} 21391

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@huntergregory
Copy link
Contributor Author

current test runs are successful excluding:

  1. flake in conf stress (verified unrelated to the change)
  2. HNS-related errors in windows cyc and conf

verified unrelated to the change by searching for "warning: ADD POD", a log line that would be hit in the control flow related to this change

@vakalapa vakalapa merged commit 09cd371 into master Feb 15, 2023
@vakalapa vakalapa deleted the npm-controller-empty-ip branch February 15, 2023 21:38
rjdenney pushed a commit that referenced this pull request Mar 13, 2023
* print statements

* cleanup Running pod with empty IP

* add log line

* revert previous 3 commits

* enqueue updates with empty IPs and add prometheus metric

* fix lints

* handle pod assigned to wrong endpoint edge case

* log and update comment

* UTs and fixed named port + build

* reset entire endpoint regardless of cache

* remove comment in dp.go

* fix windows build issues

* skip refreshing endpoints and address comments

* only sync empty ip if pod running. add tmp log

* undo special pod delete logic

* reference GH issue

* fix Windows UTs

* remove prometheus metrics and a log

---------

Co-authored-by: Vamsi Kalapala <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
npm Related to NPM.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants