Stale kernel pods #1027

amit-chandak-unskript · 2022-01-19T06:24:36Z

amit-chandak-unskript
Jan 19, 2022

We are using enterprise gateway with kubernetes. We are seeing stale kernel pods build up in our system

ubuntu@ip-10-0-1-241:~$ kubectl  get ns
NAME                                          STATUS   AGE
default                                       Active   124d
kube-system                                   Active   124d
kube-public                                   Active   124d
kube-node-lease                               Active   124d
eg                                            Active   124d
jovyan-226b0b28-9f98-4b23-8d55-46381dddbdbd   Active   89d
jovyan-4c5a5777-fefa-49f0-bd06-e06274d5c4e2   Active   82d
jovyan-f654f941-7501-4da5-832d-83378f6e7c1b   Active   82d
jovyan-bc23128f-bf17-4d2f-8334-bcb7e99ba7cd   Active   74d
jovyan-ea343e01-5bce-4668-b5a7-505dd4979948   Active   70d
jovyan-d4c81197-a656-4d5b-8c27-e96e2016a8eb   Active   70d
jovyan-6fbcfe79-87f1-4ecb-ac12-46182e18d332   Active   61d
jovyan-f552d6fc-d98f-4af0-aa60-a5e5e5ed67ba   Active   34d
jovyan-58fbb33c-33b4-4dd1-9516-30ac0ea7b1f3   Active   19d
jovyan-6a554d8a-3c73-4157-a3d4-b29bae05227c   Active   13d
ubuntu@ip-10-0-1-241:~$ kubectl get pods -n jovyan-226b0b28-9f98-4b23-8d55-46381dddbdbd
NAME                                          READY   STATUS    RESTARTS   AGE
jovyan-226b0b28-9f98-4b23-8d55-46381dddbdbd   1/1     Running   0          89d

However, as you can see the EG has been restarted just few days back.

ubuntu@ip-10-0-1-241:~$ kubectl  get pods -n eg
NAME                                  READY   STATUS        RESTARTS   AGE
kernel-image-puller-bq8b9             1/1     Terminating   1          123d
kernel-image-puller-rcbfb             1/1     Running       0          2d5h
enterprise-gateway-67f9497456-9wjc8   2/2     Running       0          2d5h

Whats causing these stale kernel pods? When EG gets killed, doesnt it bring down the kernel pods launched by it?

Answered by kevin-bates

Jan 19, 2022

I think we need to figure out under what scenario these "leaks" are occurring.

Yes, when EG gets shutdown (gracefully), it will attempt to cycle through the kernels it knows about and issue shutdown commands for those kernels (it knows about). However, if the shutdown isn't long-lived enough, I suppose there could be some orphaned pods.

Another place to look for this is across kernel restarts, since that consists of shutting down the pod and starting a new one.

I suspect the kernel managers believe they are NOT tracking the kernels associated with these pods and its the pods themselves that are the issue.

Is culling enabled? If so, you should be able to monitor the EG logs for kernel acti…

View full answer

kevin-bates · 2022-01-19T16:12:02Z

kevin-bates
Jan 19, 2022
Maintainer

I think we need to figure out under what scenario these "leaks" are occurring.

Yes, when EG gets shutdown (gracefully), it will attempt to cycle through the kernels it knows about and issue shutdown commands for those kernels (it knows about). However, if the shutdown isn't long-lived enough, I suppose there could be some orphaned pods.

Another place to look for this is across kernel restarts, since that consists of shutting down the pod and starting a new one.

I suspect the kernel managers believe they are NOT tracking the kernels associated with these pods and its the pods themselves that are the issue.

Is culling enabled? If so, you should be able to monitor the EG logs for kernel activity since each culling cycle (60 seconds by default) will produce a DEBUG entry corresponding to each kernel it knows about. If you see pods relative to kernel-ids that are not in the logs for each culling poll cycle, then it might help identify under what circumstances the leak was triggered.

0 replies

amit-chandak-unskript · 2022-01-21T21:05:57Z

amit-chandak-unskript
Jan 21, 2022
Author

Thanks a lot @kevin-bates and sorry for the delay. Yes, we have culling enabled

  # Timeout for kernel launching in seconds.
  launchTimeout: 60
  # Timeout for an idle kernel before its culled in seconds. Default is 1 hour.
  cullIdleTimeout: 3600

I think i understand how can i repro this issue consistently. We regularly do EG helm upgrade. What i see if the kernel pods which are still up, launched by the previous EG, dont go away and the new EG obviously doesnt know about to cull them.
So looks like EG shutdown is not bringing down the kernels associated with it in a graceful way, as you rightly pointed, mostly because helm shutdown may not be long-lived.

We are solving this by adding an extra script which kills all kernel namespace after the helm upgrade, not a solution i am proud of, but it does the job done.

3 replies

kevin-bates Jan 21, 2022
Maintainer

Excellent. Glad to hear you're moving forward. There are probably ways to monitor and act on helm deletion requests that might be worth looking into down the road. In the meantime, it would be helpful to document the script if that's possible.

amit-chandak-unskript Jan 22, 2022
Author

#!/bin/bash

staleNS=$(/snap/bin/kubectl get ns -l component=kernel --output=jsonpath={.items..metadata.name})
if [ ! -z $staleNS ]; then
        /snap/bin/kubectl delete ns $(/snap/bin/kubectl  get ns -l component=kernel --output=jsonpath={.items..metadata.name})
else
        echo "No stale ns"
fi

Here you go

kevin-bates Jan 24, 2022
Maintainer

Thanks. Note that this script should only be used for the default behavior where a namespace is created for each kernel. Do not use if bring-your-namespace (using KERNEL_NAMESPACE) is used. I guess in those cases, folks probably won't be adding the component=kernel label. 😉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale kernel pods #1027

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Stale kernel pods #1027

amit-chandak-unskript Jan 19, 2022

Replies: 2 comments · 3 replies

kevin-bates Jan 19, 2022 Maintainer

amit-chandak-unskript Jan 21, 2022 Author

kevin-bates Jan 21, 2022 Maintainer

amit-chandak-unskript Jan 22, 2022 Author

kevin-bates Jan 24, 2022 Maintainer

amit-chandak-unskript
Jan 19, 2022

Replies: 2 comments 3 replies

kevin-bates
Jan 19, 2022
Maintainer

amit-chandak-unskript
Jan 21, 2022
Author

kevin-bates Jan 21, 2022
Maintainer

amit-chandak-unskript Jan 22, 2022
Author

kevin-bates Jan 24, 2022
Maintainer