-
Notifications
You must be signed in to change notification settings - Fork 211
Closed
Description
This is probably not an issue for everyone, but I wanted to leave a note here in case other people are stuck with the same problem.
The issue was that the pods with SGX support were stuck in the Terminating state for a long time. The issue was tracked to the SGX Webhook:
kubectl -n kube-sgx logs -f sgx-webhook-webhook-5444cff965-cn4hz
I0317 07:11:31.984493 1 server.go:149] controller-runtime/webhook "msg"="Registering webhook" "path"="/pods-sgx"
I0317 07:11:31.984684 1 main.go:60] setup "msg"="starting manager"
I0317 07:11:31.985081 1 server.go:217] controller-runtime/webhook/webhooks "msg"="Starting webhook server"
I0317 07:11:31.985521 1 certwatcher.go:131] controller-runtime/certwatcher "msg"="Updated current TLS certificate"
I0317 07:11:31.985724 1 certwatcher.go:85] controller-runtime/certwatcher "msg"="Starting certificate watcher"
I0317 07:11:31.986040 1 server.go:271] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=9443
2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:26292: EOF
2023/03/17 07:12:32 http: TLS handshake error from 10.233.240.0:49324: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:59980: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:42953: EOF
2023/03/17 07:12:33 http: TLS handshake error from 10.233.240.0:34228: read tcp 10.233.190.251:9443->10.233.240.0:34228: read: connection reset by peer
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:38431: EOF
2023/03/17 07:12:34 http: TLS handshake error from 10.233.240.0:36956: EOF
2023/03/17 07:12:35 http: TLS handshake error from 10.233.240.0:11239: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:27806: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:3522: EOF
2023/03/17 07:12:37 http: TLS handshake error from 10.233.240.0:2116: EOF
It seems the Webhook was blocking the cleanup of the Pods, and they would be stuck for days. Undeploying the webhook would release the stuck pods immediately.
Changing the MutatingWebhookConfiguration to only act on Create and not Update resolved the issue. The current working configuration for the Webhook for us is:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
annotations:
cert-manager.io/inject-ca-from: kube-sgx/sgx-webhook-serving-cert
name: sgx-webhook-mutating-webhook-configuration
webhooks:
- admissionReviewVersions:
- v1
clientConfig:
service:
name: sgx-webhook-service
namespace: kube-sgx
path: /pods-sgx
failurePolicy: Ignore
name: sgx.mutator.webhooks.intel.com
reinvocationPolicy: Never
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
resources:
- pods
sideEffects: None
timeoutSeconds: 10
Feel free to close this issue immediately. :)
Metadata
Metadata
Assignees
Labels
No labels