-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: nightly az-ubuntu-2204 jobs are failing due to timeout #391
Comments
Hi @portersrc ! About logging the uninstall test better, that's a good idea. I'm suspicious because I created one issue about it some time ago: #181 |
@ldoktor also worked on the uninstall test while ago If I'm not mistaken, so he can give us some ideas too. |
In my attempts it hanged completely from time to time (I let it install/uninstall in a loop over a night, don't remember how many iterations it survived but eventually it stalled for infinity). So to me it looks like a real bug and not a CI timeout issue |
I'm wondering if a The latest operator nightly CI once again failed on az-ubuntu-2204 but passed on az-ubuntu-2004. The az-ubuntu-2004 log has this (line 1362 in the "Run e2e tests" step):
--whereas the az-ubuntu-2204 log has this (line 2003 in the "Run e2e tests" step):
In other words, after Any thoughts welcome. We could add logging based on this, or maybe someone already spots the error. The kustomization.yaml files |
I'm hitting this hard when using github runners: (modified version that waits for stabilization, then executes removal with a timeout multiple-times with some extra debug outputs) The
and at least today on my computer it hangs 100% of the time. In logs I can see:
the post-install also finishes and labeles correctly:
but the resources stay there:
node labels:
|
Actually I just tried quay.io/kata-containers/kata-deploy:latest and that one passed. So it seems to be a new thing on kata side... |
Yes, it works well with kata-containers-43dca8deb4891678f8a62c112749ac0938b373c6-amd64 but fails with kata-containers-63802ecdd9d608ae361be90c72749f7a1e9d5c3e-amd64 @zvonkok it looks like one of your changes between 43dca8deb4891678f8a62c112749ac0938b373c6..63802ecdd9d608ae361be90c72749f7a1e9d5c3e (kata-containers) broke the deployment (which seems likely given the git log names. Would you have any idea why is it hanging for infinity in the finalizers? |
Hi @ldoktor ! Amazing debug!
I suspect it's caused by kata-containers/kata-containers@8d9bec2 which introduced a call to |
So, I took a quick look on this on my side, the interesting part is that the @ldoktor, @wainersm, would you be able to add some debugs in the latest script from kata-containers side and also from the pre-reqs side, and then debug it using such scripts? I've noticed at least part of the operator that will need changes, but my blind shot didn't help with the situation. |
I challenge that, I'm seeing uninstall issues happening with the node set as |
As you can see, the uninstall pod comes and does the right thing.
So, it makes me think that there's something wrong with the operator logic, which ended up being exposed by that commit. |
hi @fidencio ! Thanks for promptly looking into this issue! In #412 I added some debug messages, looking at job https://github.com/confidential-containers/operator/actions/runs/10357622816/job/28670039449?pr=412 it is getting stuck in https://github.com/confidential-containers/operator/blob/main/tests/e2e/operator.sh#L166 . It halts in One thing that I didn't understand on the logs you sent in #391 (comment) is whether |
the CI is currently blocked due to recent change in kata-containers as described here: confidential-containers#391 let's restore the testing with latest-known-to-work kata-containers to allow the CI to run Signed-off-by: Lukáš Doktor <[email protected]>
Hello @fidencio I have one new observation. I tried to reproduce it with kata-deploy, first I wasn't successful, but on repeated attempt it broke the kcli node. What I used is:
The pod never started and the worker node went NotReady:
When using the kata-containers-43dca8deb4891678f8a62c112749ac0938b373c6-amd64 things work repeatedly. Do you want me to create kata-containers issue about that or is the message here sufficient? |
We are currently experiencing issues with finalizers hanging when deleting ccruntime. Initial debugging has pinpointed the problem to the processCcRuntimeDeleteRequest method. This method is large and has a cyclomatic complexity of 21. We should take this opportunity to use early returns to reduce nesting and simplify control flow. Additionally, this is moving the finalizer handling logic to its own method. This refactor should not change the current logic, only improve readability and maintainability. Related to confidential-containers#391. Signed-off-by: Beraldo Leal <[email protected]>
The kata-containers/kata-containers#10169 seems to solve this issue (1x iteration so far). Let me still run a loop for a while to double-check whether it's really stable now or just less frequent. |
20 iterations in a row are passing, let's close it for now, we can re-open if it starts failing in the CI. Big thanks to everyone involved. |
We are currently experiencing issues with finalizers hanging when deleting ccruntime. Initial debugging has pinpointed the problem to the processCcRuntimeDeleteRequest method. This method is large and has a cyclomatic complexity of 21. We should take this opportunity to use early returns to reduce nesting and simplify control flow. Additionally, this is moving the finalizer handling logic to its own method. This refactor should not change the current logic, only improve readability and maintainability. Related to confidential-containers#391. Signed-off-by: Beraldo Leal <[email protected]>
This occurs after successful tests and during the uninstall step.
Example failure: https://github.com/confidential-containers/operator/actions/runs/9638492426/job/26579262744
The text was updated successfully, but these errors were encountered: