Pods stuck in terminating state

This is probably not an issue for everyone, but I wanted to leave a note here in case other people are stuck with the same problem.

The issue was that the pods with SGX support were stuck in the Terminating state for a long time. The issue was tracked to the SGX Webhook:

```
kubectl -n kube-sgx logs -f sgx-webhook-webhook-5444cff965-cn4hz
I0317 07:11:31.984493       1 server.go:149] controller-runtime/webhook "msg"="Registering webhook" "path"="/pods-sgx"
I0317 07:11:31.984684       1 main.go:60] setup "msg"="starting manager"
I0317 07:11:31.985081       1 server.go:217] controller-runtime/webhook/webhooks "msg"="Starting webhook server"
I0317 07:11:31.985521       1 certwatcher.go:131] controller-runtime/certwatcher "msg"="Updated current TLS certificate"
I0317 07:11:31.985724       1 certwatcher.go:85] controller-runtime/certwatcher "msg"="Starting certificate watcher"
I0317 07:11:31.986040       1 server.go:271] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=9443

2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:26292: EOF
2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:49324: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:59980: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:42953: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:34228: read tcp 10.233.190.251:9443->10.233.240.0:34228: read: connection reset by peer
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:38431: EOF
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:36956: EOF
2023/03/17 07:12:35 http: TLS handshake error from 10.233.240.0:11239: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:27806: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:3522: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:2116: EOF
```

It seems the Webhook was blocking the cleanup of the Pods, and they would be stuck for days. Undeploying the webhook would release the stuck pods immediately.

Changing the MutatingWebhookConfiguration to only act on Create and not Update resolved the issue. The current working configuration for the Webhook for us is:

```
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    cert-manager.io/inject-ca-from: kube-sgx/sgx-webhook-serving-cert
  name: sgx-webhook-mutating-webhook-configuration
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: sgx-webhook-service
      namespace: kube-sgx
      path: /pods-sgx
  failurePolicy: Ignore
  name: sgx.mutator.webhooks.intel.com
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
  sideEffects: None
  timeoutSeconds: 10
  ```

Feel free to close this issue immediately. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pods stuck in terminating state #1357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pods stuck in terminating state #1357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions