-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster API periodic test jobs are getting stuck in pending #8560
Comments
cc @kubernetes-sigs/cluster-api-release-team |
I played around a bit and found:
I see it in other release branches as well (1.2, 1.3 & 1.4)
Actually, I do not think this is necessary, as long as you authorize yourself using your GH account with prow, anyone can cancel and re-run the job (tried it myself)
Since we do not have a direct access to prow cluster I can't think of a way we could check prow components |
@furkatgofurov7 were you testing rerun on PR jobs or periodics? I wasn't able to rerun periodics until we added the cluster-api-maintainers group to rerun authconfig in our prowjob config a few weeks back.
I meant looking into the source code + the data we see under artifacts. Some components like the UI (deck) can also be run locally without access to the cluster |
I believe periodic, and the one from the filter you provided, it was https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.2#capi-e2e-mink8s-release-1-2 one
👍🏼 |
That's strange, this shouldn't be possible (xref: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes-sigs/cluster-api/cluster-api-periodics-release-1-2.yaml#L213-L216). (btw on PRs however it's different. Not sure exactly but either the PR author, org members or everyone can restart there) Independent of all of that. We can definitely also start with the following:
|
/triage accepted |
Two new ones stuck in Pending: Both are marked as failed in Spyglass:
Both show the following at the end:
|
Note: I don't have the permissions to stop or restart them. |
I wonder what happened when CAPD tried to remove this container (I assume it tried it)? Do we know if we have the same error also in jobs which are then not stuck in pending? Not sure if there is a way to force remove the container more than the job already tries. Maybe there's something in the logs of this container which explains why it's not shutting down. |
I was not able to figure something out on both jobs from CAPD or other logs. For both occurrences, CAPD logged the dockermachine deletion as successful.
I did not find any via k8s-triage.
According this, it is already a
We don't fetch them currently. Docker logs and containerd logs did not help in this case. |
I meant the logs of the container of the wl cluster: e.g. https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1653406428393639936/artifacts/clusters/quick-start-s3y53t/machines/ (that's the one which doesn't shutdown, right?) Maybe we can also extend CAPD to actually check if the container is gone (?) |
Nope, the logs or data of the in this case relevant node/machine are not there (should be However, CAPD logs the container's stdout before removing it, but there is nothing suspicious in it 🤷♂️ Source: Search for the machine name.
Yeah, that'd maybe a good idea to get more near to the issue. |
Detecting that case in CAPD would make it possible to dump more data (like all the files we have in artifacts usually) |
One idea to proceed: Check if this is caused by a zombie process.
|
Two more pending. There is no https://prow.k8s.io/?repo=kubernetes-sigs%2Fcluster-api&state=pending |
Restarted + cherry-picked your PR |
I'm not sure what we already have. Do you think it makes sense to run docker inspect + logs on all leftover containers at this point? I wonder if that container is even in "shutting down" |
Good question. Would be also good to maybe take a look at the pid's of the leftover ones. E.g. via Edit: also maybe: |
Sounds good |
cc @sbueringer : a new stuck in pending job, ready to get deleted :-) Some analysis from the newest pending but finished job: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-dualstack-ipv6-main/1663826741237387264 : Continuing on the new data and the stuck root 113973 97202 0 09:01 ? 00:00:00 runc init We got the following outputs for
and for
So we can see the reason for the process to not finish is, that it is in an uninterruptible From issues researched at runc (e.g. opencontainers/runc#3663, opencontainers/runc#2753) there had been similar issues already in the past. And the source for the issue could also be in the kernel. We could try to get more output via:
But I don't know if that information would help us further, this is just some random google find which could help digging into it. (I would first have to learn then what that output means). |
Maybe a good next step is to first move some jobs to the community cluster and then maybe we can do some debugging with @ameukam. (when Christian is back from PTO) P.S. restarted the pending job |
Another job in pending, but failed since two days: |
Thx. Restarted so we have test coverage for the release today |
@sbueringer : another one pending since two days: |
Note: @sbueringer another three ones pending today. |
/priority important-longterm |
Haven't seen this in a while. @adilGhaffarDev Are you aware of any recent occurences? |
cc @killianmuldoon @chrischdi (in case you unblocked any recently) |
Same here, I haven't seen any for a while. |
Let's close for now. Let's reopen if it occurs again /close |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take and what happened?
From time to time some of our periodic test jobs are getting stuck in pending. I.e. a run of the test job is shown as Pending and no new runs are scheduled.
Example:
(URL: https://prow.k8s.io/?repo=kubernetes-sigs%2Fcluster-api&job=periodic*&state=pending)
In this exampe periodic-cluster-api-e2e-main is stuck in pending.
What did you expect to happen?
Jobs should be just scheduled continuously and not get stuck in pending and block further runs.
Cluster API version
main (probably also on other branches)
Kubernetes version
No response
Anything else you would like to add?
Notes:
Mitigation:
Impact:
Debug ideas:
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
/area testing
The text was updated successfully, but these errors were encountered: