Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: nightly az-ubuntu-2204 jobs are failing due to timeout #391

Closed
portersrc opened this issue Jun 26, 2024 · 16 comments
Closed

CI: nightly az-ubuntu-2204 jobs are failing due to timeout #391

portersrc opened this issue Jun 26, 2024 · 16 comments

Comments

@portersrc
Copy link
Member

portersrc commented Jun 26, 2024

This occurs after successful tests and during the uninstall step.

Example failure: https://github.com/confidential-containers/operator/actions/runs/9638492426/job/26579262744

@wainersm
Copy link
Member

Hi @portersrc !

About logging the uninstall test better, that's a good idea. I'm suspicious because I created one issue about it some time ago: #181

@wainersm
Copy link
Member

@ldoktor also worked on the uninstall test while ago If I'm not mistaken, so he can give us some ideas too.

@ldoktor
Copy link
Contributor

ldoktor commented Jun 27, 2024

In my attempts it hanged completely from time to time (I let it install/uninstall in a loop over a night, don't remember how many iterations it survived but eventually it stalled for infinity). So to me it looks like a real bug and not a CI timeout issue

@portersrc
Copy link
Member Author

portersrc commented Jun 27, 2024

I'm wondering if a kubectl delete -k is hanging in uninstall_ccruntime or uninstall_operator.

The latest operator nightly CI once again failed on az-ubuntu-2204 but passed on az-ubuntu-2004.

The az-ubuntu-2004 log has this (line 1362 in the "Run e2e tests" step):

...ccruntime.confidentialcontainers.org/ccruntime-sample created
::info:: No new restarts in 3x21s, proceeding...
::info:: Run tests
INFO: Running operator tests for kata-qemu
1..2
ok 1 [cc][operator] Test can uninstall the operator
ok 2 [cc][operator] Test can reinstall the operator
::info:: Uninstall the operator
ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
namespace "confidential-containers-system" deleted
customresourcedefinition.apiextensions.k8s.io "ccruntimes.confidentialcontainers.org" deleted
serviceaccount "cc-operator-controller-manager" deleted
role.rbac.authorization.k8s.io "cc-operator-leader-election-role" deleted
clusterrole.rbac.authorization.k8s.io "cc-operator-manager-role" deleted
clusterrole.rbac.authorization.k8s.io "cc-operator-metrics-reader" deleted
clusterrole.rbac.authorization.k8s.io "cc-operator-proxy-role" deleted
rolebinding.rbac.authorization.k8s.io "cc-operator-leader-election-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "cc-operator-manager-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "cc-operator-proxy-rolebinding" deleted
configmap "cc-operator-manager-config" deleted
service "cc-operator-controller-manager-metrics-service" deleted
deployment.apps "cc-operator-controller-manager" deleted
No resources found in confidential-containers-system namespace.
::info:: Shutdown the cluster
...

--whereas the az-ubuntu-2204 log has this (line 2003 in the "Run e2e tests" step):

...
ccruntime.confidentialcontainers.org/ccruntime-sample created
::info:: No new restarts in 3x21s, proceeding...
::info:: Run tests
INFO: Running operator tests for kata-qemu
1..2
ok 1 [cc][operator] Test can uninstall the operator
ok 2 [cc][operator] Test can reinstall the operator
::info:: Uninstall the operator
ccruntime.confidentialcontainers.org "ccruntime-sample" deleted
Error: The action 'Run e2e tests' has timed out after 45 minutes.

In other words, after ccruntime.confidentialcontainers.org "ccruntime-sample" deleted, we never see namespace "confidential-containers-system" deleted in the az-ubuntu-2204 case. If a kubectl delete -k command is hanging, that could also explain why our logs don't show any nice output from wait_for_process, either (because control never reaches there).

Any thoughts welcome. We could add logging based on this, or maybe someone already spots the error. The kustomization.yaml files kubectl is using for deletion seem to be here (for uninstall ccruntime) and here (uninstall operator).

@ldoktor
Copy link
Contributor

ldoktor commented Aug 8, 2024

I'm hitting this hard when using github runners:

(modified version that waits for stabilization, then executes removal with a timeout multiple-times with some extra debug outputs)

The kubectl delete hangs in finalizers. I tried reproducing it locally with:

kcli create vm -i ubuntu2204 -P memory=8G -P numcpus=4 -P disks=[50] e2e
kcli ssh e2e
apt-get install ansible git
git clone https://github.com/confidential-containers/operator.git
cd operator
patch -p1 << \EOF
diff --git a/tests/e2e/run-local.sh b/tests/e2e/run-local.sh
index fe407d4..e737e03 100755
--- a/tests/e2e/run-local.sh
+++ b/tests/e2e/run-local.sh
@@ -112,6 +112,7 @@ main() {
        echo "::info:: Build and install the operator"
        step_install_operator=1
        run 20m sudo -E PATH="$PATH" bash -c './operator.sh'
+       exit -12
 
        echo "::info:: Run tests"
        local cmd="run 20m sudo -E PATH=\"$PATH\" bash -c "
EOF
cd tests/e2e
bash -x run-local.sh
# This should exit with -12 meaning operator is installed and that's it

# Let's try removing it now
cd ../../config/samples/ccruntime/default
sudo -E KUBECONFIG=/etc/kubernetes/admin.conf kubectl -v=8  delete -k .

and at least today on my computer it hangs 100% of the time.

In logs I can see:

kubectl -n confidential-containers-system logs -f daemonset.apps/cc-operator-daemon-uninstall

Environment variables passed to this script
* NODE_NAME: e2e
* DEBUG: false
* SHIMS: clh qemu qemu-tdx qemu-sev qemu-snp
* DEFAULT_SHIM: qemu
* CREATE_RUNTIMECLASSES: true
* CREATE_DEFAULT_RUNTIMECLASS: true
* ALLOWED_HYPERVISOR_ANNOTATIONS: 
* SNAPSHOTTER_HANDLER_MAPPING: clh:nydus,qemu:nydus,qemu-tdx:nydus,qemu-sev:nydus,qemu-snp:nydus
* AGENT_HTTPS_PROXY: 
* AGENT_NO_PROXY: 
* PULL_TYPE_MAPPING: 
Validating the snapshotter-handler mapping: "clh:nydus,qemu:nydus,qemu-tdx:nydus,qemu-sev:nydus,qemu-snp:nydus"
Deleting the runtime classes
Deleting the kata-clh runtime class
runtimeclass.node.k8s.io "kata-clh" deleted
Deleting the kata-qemu runtime class
runtimeclass.node.k8s.io "kata-qemu" deleted
Deleting the kata-qemu-tdx runtime class
runtimeclass.node.k8s.io "kata-qemu-tdx" deleted
Deleting the kata-qemu-sev runtime class
runtimeclass.node.k8s.io "kata-qemu-sev" deleted
Deleting the kata-qemu-snp runtime class
runtimeclass.node.k8s.io "kata-qemu-snp" deleted
Deleting the kata runtime class for the default shim (an alias for kata-qemu)
runtimeclass.node.k8s.io "kata" deleted
node/e2e not labeled
deleting kata artifacts
node/e2e unlabeled

the post-install also finishes and labeles correctly:

INSTALL_COCO_CONTAINERD: false
INSTALL_OFFICIAL_CONTAINERD: false
INSTALL_VFIO_GPU_CONTAINERD: false
INSTALL_NYDUS_SNAPSHOTTER: true
Remove nydus snapshotter from containerd
Removed /etc/systemd/system/containerd.service.requires/nydus-snapshotter.service.
Restarting containerd
Removing nydus-snapshotter artifacts from host
node/e2e labeled

but the resources stay there:

NAME                                                 READY   STATUS    RESTARTS   AGE
pod/cc-operator-controller-manager-ccbbcfdf7-cgntl   2/2     Running   0          15m
pod/cc-operator-daemon-install-rbjmb                 1/1     Running   0          11m
pod/cc-operator-post-uninstall-daemon-2p4gl          1/1     Running   0          65s
pod/cc-operator-pre-install-daemon-dbm59             1/1     Running   0          11m

NAME                                                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/cc-operator-controller-manager-metrics-service   ClusterIP   10.96.21.127   <none>        8443/TCP   15m

NAME                                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                            AGE
daemonset.apps/cc-operator-daemon-install          1         1         1       1            1           node.kubernetes.io/worker=               11m
daemonset.apps/cc-operator-daemon-uninstall        0         0         0       0            0           katacontainers.io/kata-runtime=cleanup   11m
daemonset.apps/cc-operator-post-uninstall-daemon   1         1         1       1            1           node.kubernetes.io/worker=               65s
daemonset.apps/cc-operator-pre-install-daemon      1         1         1       1            1           node.kubernetes.io/worker=               11m

NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cc-operator-controller-manager   1/1     1            1           15m

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/cc-operator-controller-manager-ccbbcfdf7   1         1         1       15m

node labels:

NAME   STATUS   ROLES           AGE   VERSION   LABELS
e2e    Ready    control-plane   15m   v1.24.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cc-postuninstall/done=true,cc-preinstall/done=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=e2e,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers=,node.kubernetes.io/worker=

@ldoktor
Copy link
Contributor

ldoktor commented Aug 8, 2024

@bpradipt, @fidencio would you have some time to look at this, please? It used to be rare but currently it's 100% reproducible on my kcli and blocks the move from azure runners to github runners.

@ldoktor
Copy link
Contributor

ldoktor commented Aug 9, 2024

Actually I just tried quay.io/kata-containers/kata-deploy:latest and that one passed. So it seems to be a new thing on kata side...

@ldoktor
Copy link
Contributor

ldoktor commented Aug 9, 2024

Yes, it works well with kata-containers-43dca8deb4891678f8a62c112749ac0938b373c6-amd64 but fails with kata-containers-63802ecdd9d608ae361be90c72749f7a1e9d5c3e-amd64 @zvonkok it looks like one of your changes between 43dca8deb4891678f8a62c112749ac0938b373c6..63802ecdd9d608ae361be90c72749f7a1e9d5c3e (kata-containers) broke the deployment (which seems likely given the git log names. Would you have any idea why is it hanging for infinity in the finalizers?

@wainersm
Copy link
Member

wainersm commented Aug 9, 2024

Hi @ldoktor ! Amazing debug!

Yes, it works well with kata-containers-43dca8deb4891678f8a62c112749ac0938b373c6-amd64 but fails with kata-containers-63802ecdd9d608ae361be90c72749f7a1e9d5c3e-amd64 @zvonkok it looks like one of your changes between 43dca8deb4891678f8a62c112749ac0938b373c6..63802ecdd9d608ae361be90c72749f7a1e9d5c3e (kata-containers) broke the deployment (which seems likely given the git log names. Would you have any idea why is it hanging for infinity in the finalizers?

I suspect it's caused by kata-containers/kata-containers@8d9bec2 which introduced a call to reset_runtime in kata-deploy.sh cleanup command. reset_runtime calls wait_till_node_is_ready that in turn enter in an infinity loop case the node never becomes Ready again.

@fidencio
Copy link
Member

So, I took a quick look on this on my side, the interesting part is that the uninstall part, which is the part that Zvonko touched, finishes without issues, and the pod finishes.

@ldoktor, @wainersm, would you be able to add some debugs in the latest script from kata-containers side and also from the pre-reqs side, and then debug it using such scripts?

I've noticed at least part of the operator that will need changes, but my blind shot didn't help with the situation.

@fidencio
Copy link
Member

that in turn enter in an infinity loop case the node never becomes Ready again.

I challenge that, I'm seeing uninstall issues happening with the node set as Ready. :-/

@fidencio
Copy link
Member

ubuntu@operator:~$ kubectl get pods -A --watch
NAMESPACE                        NAME                                             READY   STATUS              RESTARTS        AGE
confidential-containers-system   cc-operator-controller-manager-ccbbcfdf7-bq8ch   2/2     Running             0               24s
confidential-containers-system   cc-operator-daemon-install-h5cmc                 0/1     ContainerCreating   0               0s
confidential-containers-system   cc-operator-pre-install-daemon-mnsv8             1/1     Running             0               3s
kube-flannel                     kube-flannel-ds-8rh8h                            1/1     Running             0               41s
kube-system                      coredns-6d4b75cb6d-92bhj                         1/1     Running             0               41s
kube-system                      coredns-6d4b75cb6d-p6kbd                         1/1     Running             0               41s
kube-system                      etcd-operator                                    1/1     Running             2 (4m25s ago)   55s
kube-system                      kube-apiserver-operator                          1/1     Running             2 (4m3s ago)    55s
kube-system                      kube-controller-manager-operator                 1/1     Running             2 (4m3s ago)    55s
kube-system                      kube-proxy-9jv97                                 1/1     Running             0               41s
kube-system                      kube-scheduler-operator                          1/1     Running             2 (4m3s ago)    58s
confidential-containers-system   cc-operator-daemon-install-h5cmc                 1/1     Running             0               1s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               0/1     Pending             0               0s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               0/1     Pending             0               0s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               0/1     ContainerCreating   0               0s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               1/1     Running             0               2s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               1/1     Terminating         0               2s
kube-system                      kube-scheduler-operator                          0/1     Running             2 (8m21s ago)   5m16s
kube-system                      kube-controller-manager-operator                 0/1     Running             2 (8m21s ago)   5m13s
kube-system                      kube-apiserver-operator                          0/1     Running             2 (8m22s ago)   5m14s
kube-system                      etcd-operator                                    0/1     Running             2 (8m44s ago)   5m14s
kube-system                      coredns-6d4b75cb6d-92bhj                         0/1     Running             0               5m
kube-system                      coredns-6d4b75cb6d-p6kbd                         0/1     Running             0               5m1s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               0/1     Terminating         0               11s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               0/1     Terminating         0               11s
confidential-containers-system   cc-operator-daemon-uninstall-bvpsw               0/1     Terminating         0               11s
confidential-containers-system   cc-operator-controller-manager-ccbbcfdf7-bq8ch   1/2     Running             0               4m45s
kube-system                      kube-scheduler-operator                          0/1     Running             2 (8m25s ago)   5m20s
kube-system                      coredns-6d4b75cb6d-92bhj                         1/1     Running             0               5m3s
kube-system                      kube-scheduler-operator                          1/1     Running             2 (8m26s ago)   5m21s
kube-system                      coredns-6d4b75cb6d-p6kbd                         1/1     Running             0               5m4s
confidential-containers-system   cc-operator-controller-manager-ccbbcfdf7-bq8ch   2/2     Running             0               4m48s
kube-system                      kube-controller-manager-operator                 1/1     Running             2 (8m27s ago)   5m19s
kube-system                      kube-apiserver-operator                          1/1     Running             2 (8m28s ago)   5m20s
kube-system                      etcd-operator                                    1/1     Running             2 (8m50s ago)   5m20s

As you can see, the uninstall pod comes and does the right thing.
And then checking /opt/kata ....

ubuntu@operator:~$ ls /opt/kata/
ubuntu@operator:~$

So, it makes me think that there's something wrong with the operator logic, which ended up being exposed by that commit.

@wainersm
Copy link
Member

hi @fidencio ! Thanks for promptly looking into this issue!

In #412 I added some debug messages, looking at job https://github.com/confidential-containers/operator/actions/runs/10357622816/job/28670039449?pr=412 it is getting stuck in https://github.com/confidential-containers/operator/blob/main/tests/e2e/operator.sh#L166 . It halts in kubectl delete -k . until it hits the test timeout of 600s....

One thing that I didn't understand on the logs you sent in #391 (comment) is whether cc-operator-daemon-uninstall-bvpsw hangs in Terminating or really hit Terminated

ldoktor added a commit to ldoktor/coco-operator that referenced this issue Aug 13, 2024
the CI is currently blocked due to recent change in kata-containers as
described here:

    confidential-containers#391

let's restore the testing with latest-known-to-work kata-containers to
allow the CI to run

Signed-off-by: Lukáš Doktor <[email protected]>
@ldoktor
Copy link
Contributor

ldoktor commented Aug 14, 2024

Hello @fidencio I have one new observation. I tried to reproduce it with kata-deploy, first I wasn't successful, but on repeated attempt it broke the kcli node.

What I used is:

pushd kata-containers/tests/integration/kubernetes
CLUSTER_DISK_SIZE=40 ./gha-run.sh create-cluster-kcli
pushd kata-containers/ci/openshift-ci
KATA_RUNTIME=kata-qemu KATA_DEPLOY_IMAGE=quay.io/kata-containers/kata-deploy-ci:kata-containers-63802ecdd9d608ae361be90c72749f7a1e9d5c3e-amd64 ./test.sh
# Then I manually modified the
#    tools/packaging/kata-deploy/kata-cleanup/base/kata-cleanup.yaml
# to use:
#    quay.io/kata-containers/kata-deploy-ci:kata-containers-63802ecdd9d608ae361be90c72749f7a1e9d5c3e-amd64
./cleanup

# Then I repeated the same steps second time:
KATA_RUNTIME=kata-qemu KATA_DEPLOY_IMAGE=quay.io/kata-containers/kata-deploy-ci:kata-containers-63802ecdd9d608ae361be90c72749f7a1e9d5c3e-amd64 ./test.sh

The pod never started and the worker node went NotReady:

  Warning  CgroupV1                 2m27s              kubelet          Cgroup v1 support is in maintenance mode, please migrate to Cgroup v2.
  Warning  InvalidDiskCapacity      2m27s              kubelet          invalid capacity 0 on image filesystem
  Normal   NodeAllocatableEnforced  2m27s              kubelet          Updated Node Allocatable limit across pods
  Normal   Starting                 2m27s              kubelet          Starting kubelet.
  Normal   NodeHasNoDiskPressure    2m27s              kubelet          Node kata-k8s-worker-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     2m27s              kubelet          Node kata-k8s-worker-0 status is now: NodeHasSufficientPID
  Normal   Starting                 2m25s              kubelet          Starting kubelet.
  Warning  CgroupV1                 2m25s              kubelet          Cgroup v1 support is in maintenance mode, please migrate to Cgroup v2.
  Normal   NodeAllocatableEnforced  2m25s              kubelet          Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  2m25s              kubelet          Node kata-k8s-worker-0 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    2m25s              kubelet          Node kata-k8s-worker-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     2m25s              kubelet          Node kata-k8s-worker-0 status is now: NodeHasSufficientPID
  Normal   NodeNotReady             83s                kubelet          Node kata-k8s-worker-0 status is now: NodeNotReady
  Warning  ContainerGCFailed        25s (x2 over 85s)  kubelet          rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService

When using the kata-containers-43dca8deb4891678f8a62c112749ac0938b373c6-amd64 things work repeatedly. Do you want me to create kata-containers issue about that or is the message here sufficient?

beraldoleal added a commit to beraldoleal/operator that referenced this issue Aug 14, 2024
We are currently experiencing issues with finalizers hanging when
deleting ccruntime. Initial debugging has pinpointed the problem to the
processCcRuntimeDeleteRequest method.

This method is large and has a cyclomatic complexity of 21. We should
take this opportunity to use early returns to reduce nesting and
simplify control flow. Additionally, this is moving the finalizer
handling logic to its own method.

This refactor should not change the current logic, only improve
readability and maintainability.

Related to confidential-containers#391.

Signed-off-by: Beraldo Leal <[email protected]>
@github-project-automation github-project-automation bot moved this to 🆕 New in CoCo Releases Aug 15, 2024
@ldoktor
Copy link
Contributor

ldoktor commented Aug 16, 2024

The kata-containers/kata-containers#10169 seems to solve this issue (1x iteration so far). Let me still run a loop for a while to double-check whether it's really stable now or just less frequent.

@ldoktor
Copy link
Contributor

ldoktor commented Aug 16, 2024

20 iterations in a row are passing, let's close it for now, we can re-open if it starts failing in the CI. Big thanks to everyone involved.

@ldoktor ldoktor closed this as completed Aug 16, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in CoCo Releases Aug 16, 2024
beraldoleal added a commit to beraldoleal/operator that referenced this issue Aug 20, 2024
We are currently experiencing issues with finalizers hanging when
deleting ccruntime. Initial debugging has pinpointed the problem to the
processCcRuntimeDeleteRequest method.

This method is large and has a cyclomatic complexity of 21. We should
take this opportunity to use early returns to reduce nesting and
simplify control flow. Additionally, this is moving the finalizer
handling logic to its own method.

This refactor should not change the current logic, only improve
readability and maintainability.

Related to confidential-containers#391.

Signed-off-by: Beraldo Leal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants