Skip to content

Pods stuck in terminating state #1357

@Setomidor

Description

@Setomidor

This is probably not an issue for everyone, but I wanted to leave a note here in case other people are stuck with the same problem.

The issue was that the pods with SGX support were stuck in the Terminating state for a long time. The issue was tracked to the SGX Webhook:

kubectl -n kube-sgx logs -f sgx-webhook-webhook-5444cff965-cn4hz
I0317 07:11:31.984493       1 server.go:149] controller-runtime/webhook "msg"="Registering webhook" "path"="/pods-sgx"
I0317 07:11:31.984684       1 main.go:60] setup "msg"="starting manager"
I0317 07:11:31.985081       1 server.go:217] controller-runtime/webhook/webhooks "msg"="Starting webhook server"
I0317 07:11:31.985521       1 certwatcher.go:131] controller-runtime/certwatcher "msg"="Updated current TLS certificate"
I0317 07:11:31.985724       1 certwatcher.go:85] controller-runtime/certwatcher "msg"="Starting certificate watcher"
I0317 07:11:31.986040       1 server.go:271] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=9443

2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:26292: EOF
2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:49324: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:59980: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:42953: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:34228: read tcp 10.233.190.251:9443->10.233.240.0:34228: read: connection reset by peer
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:38431: EOF
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:36956: EOF
2023/03/17 07:12:35 http: TLS handshake error from 10.233.240.0:11239: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:27806: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:3522: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:2116: EOF

It seems the Webhook was blocking the cleanup of the Pods, and they would be stuck for days. Undeploying the webhook would release the stuck pods immediately.

Changing the MutatingWebhookConfiguration to only act on Create and not Update resolved the issue. The current working configuration for the Webhook for us is:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    cert-manager.io/inject-ca-from: kube-sgx/sgx-webhook-serving-cert
  name: sgx-webhook-mutating-webhook-configuration
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: sgx-webhook-service
      namespace: kube-sgx
      path: /pods-sgx
  failurePolicy: Ignore
  name: sgx.mutator.webhooks.intel.com
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
  sideEffects: None
  timeoutSeconds: 10

Feel free to close this issue immediately. :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions