Java jobs using ubuntu-latest get cancelled/skipped midway with no logs available #1491

jaikiran · 2020-08-26T12:35:10Z

The past few weeks, one of the projects that I contribute to (https://github.com/quarkusio/quarkus/) has been running into odd issues with the GitHub actions. The project uses GitHub actions to trigger a workflow on PR creation, which internally triggers around 31 jobs. Most of these jobs succeed. However, there are 2-3 jobs which are exhibiting odd failures the past few weeks. All our investigation so far has been almost blocked by the lack of logs from these jobs when they fail. The symptoms we have seen so far is the following:

The job starts running (it's expected to finish in around 2 hours or more and has been configured for a timeout of 4 hours)
After around more than an hour the GitHub actions UI, in the PR, shows that job as cancelled/skipped. At this point, if we go and look into the job, the logs are available (from the GitHub UI) for that job but those logs don't contain any specific reason explaining why the job was skipped or failed (while it was still running tests). The message we have seen so far for such jobs being cancelled is:

....
2020-08-24T10:35:46.2127997Z ##[error]The operation was canceled.
2020-08-24T10:35:47.0063961Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This above thing happens for 2-3 jobs and they show the status as cancelled/skipped
Once all the jobs in that workflow have completed, the status of the 2-3 jobs which was previously showing as cancelled/skipped now ends up being shown as failed. At this point, if we go back to these jobs and check their logs, the logs have all disappeared and the GitHub UI for these jobs shows the following minimal log:

2020-08-24T16:45:37.8609235Z ##[section]Starting: Request a runner to run this job
2020-08-24T16:45:38.9997701Z Can't find any online and idle self-hosted runner in current repository that matches the required labels: 'ubuntu-latest'
2020-08-24T16:45:38.9997791Z Can't find any online and idle self-hosted runner in current repository's account/organization that matches the required labels: 'ubuntu-latest'
2020-08-24T16:45:38.9997999Z Found online and idle hosted runner in current repository's account/organization that matches the required labels: 'ubuntu-latest'
2020-08-24T16:45:39.3330338Z ##[section]Finishing: Request a runner to run this job

This issue has practically made it almost impossible for us to validate PRs for the past few weeks.

All these jobs are Java jobs (i.e. they run Java applications and tests) and all of these use "ubuntu-latest".

@maxim-lobanov, I see a similar issue reported for MacOS runners here #736 and I see that you asked users to report if they still see this issue. I didn't want to add these details there since this is ubuntu-latest images, so I created this one.

If you want to see a sample PR which exhibited this issue (which we anyway went ahead and merged), then here's one quarkusio/quarkus#11512 (I'm sure you know this, but if not, then to check these jobs that were run and failed, click on the "View Details" at the end of PR and select the "JDK Java8 JVM Tests" job for details and logs). If you look at the recent PRs in that repo almost all are affected by this. We have a discussion going on in our mailing list too, if it helps to set some context https://groups.google.com/d/msg/quarkus-dev/yfbiBPjm6cM/G1Na5I0TBgAJ

The text was updated successfully, but these errors were encountered:

maxim-lobanov · 2020-08-26T12:53:06Z

@jaikiran , thank you for report. It definitely something different from MacOS issue and I have never seen the same reports for ubuntu before and we have not changed infrastructure somehow. We will take a look.

A few questions:

The one possible reason for such issue is that VM can be exhausted from resources and internet connection side. How big this project is and can it cause something similar?
How many days ago have you started to experience this issue? Just for debug purpose, we can create test branch from commit where your builds were green and try to run builds on those changes, just to discard code changes factor.

jaikiran · 2020-08-26T13:10:20Z

Hello @maxim-lobanov,

The one possible reason for such issue is that VM can be exhausted from resources and internet connection side. How big this project is and can it cause something similar?

It's (relatively) big and it does do things like:

One job within the workflow does an initial build which generates Maven artifacts (which are mostly jar files) locally and uses the upload-artifact action to upload a tar file out of these artifacts. The file size can/will be large (and in GBs) given how Maven (the Java build tool) works.
This uploaded artifact is then downloaded, using download-artifact action by each of these jobs as one of the initial steps. This step take very little time (few minutes) and has never failed (nor cancelled/skipped).
Once that download step is done each of these jobs do various things which can be classified as follows:
- Runs Java applications as part of the test
- Start "services" which can be databases and various other such processes
- Use docker to pull images and start docker containers
- Spawn multiple (mostly Java) processes in some of these tests
  The jobs which are running into this issue, each can/will do each of these above steps during its lifetime. So yes, they are resource intensive.

How many days ago have you started to experience this issue? Just for debug purpose, we can create test branch from commit where your builds were green and try to run builds on those changes, just to discard code changes factor.

We started observing this at least 3 weeks back (around 22 to 23 days back somewhere around 2nd August), I think. @gsmet who keeps a more detailed eye on these PRs in that project can confirm this or correct me if that isn't an accurate timeline.
I don't want to jump to conclusions, but my very basic investigation the past few days has given me an impression that this perhaps started happening very regularly with the 202008xx.x series of the images:

2020-08-06T14:50:50.9772796Z ##[endgroup]
2020-08-06T14:50:50.9773000Z ##[group]Virtual Environment
2020-08-06T14:50:50.9773294Z Environment: ubuntu-18.04
2020-08-06T14:50:50.9773582Z Version: 20200802.1
2020-08-06T14:50:50.9773951Z Included Software: https://github.com/actions/virtual-environments/blob/ubuntu18/20200802.1/images/linux/Ubuntu1804-README.md
2020-08-06T14:50:50.9774273Z ##[endgroup]

From what I can gather, we rarely saw this with 202007xx.x series. Of course there were few instances where I can find this issue in that series too, so please don't consider this as a definitive hint.

jaikiran · 2020-08-26T13:12:19Z

By the way, this is what the workflow file looks like https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml (the one which triggers those 31 odd jobs) and the jobs that fail is defined here https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml#L100 (that file has seen some minor changes the past few days, but in its current form too it runs into the issue)

jaikiran · 2020-08-26T13:26:50Z

One of project members just replied on the Quarkus dev mailing list[1]:

apparently we get out of both memory and swap. And that's our main issue.

So it looks like these jobs run into resource issues (in our case the memory and swap) and get initally marked as cancelled/skipped and finally error out. So I think this now comes down to:

Would it be possible to report these underlying cause/resource usage problems in the logs of these failed jobs?
Can the logs generated during the lifetime of the job still stay intact without going missing?

[1] https://groups.google.com/d/msg/quarkus-dev/yfbiBPjm6cM/JJEMSX9pAAAJ

maxim-lobanov · 2020-08-26T15:49:58Z

Such issues can't be reported better for now. From backend side, this case looks like VM has stopped heartbeat and stopped receiving any messages.

We do some internal work to improve such issues handling (freeze such machines, get all necessary logs and etc) but unfortunately, I can't provide any ETA since this work is still in progress

jaikiran · 2020-08-27T01:42:05Z

Thank for looking into this @maxim-lobanov

We do some internal work to improve such issues handling (freeze such machines, get all necessary logs and etc) but unfortunately, I can't provide any ETA since this work is still in progress

That's alright, I'll just keep a watch on this issue for the progress. Given that we have narrowed this down to the root cause, we are taking some measures to try and prevent this in the meantime in the Quarkus project.

maxim-lobanov · 2020-09-04T06:56:04Z

I am closing this issue for now and we will announce separately when the work to improve handling of exhausted machines is done.
Please let us know if you need any additional help with investigation of current issue.

jaikiran · 2020-09-04T06:57:58Z

Please let us know if you need any additional help with investigation of current issue.

We managed to solve this issue with doing changes to our workflow jobs and fixing a few other things. So we are no longer running into resource exhaustion on these VMs. Thank you for your help on this issue.

projectsal · 2022-06-10T05:08:49Z

I have met the same problem.When I tried to build the app written by kotlin,it also throw the error "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.",and it do not give a detailed log.
The job is expected to complete in an hour and I set the timeout to 120 minutes.But when two minutes passed,the gradle build failed.The log just show "The job is expected to complete in an hour and I set the timeout to 120 minutes".
Here is the link.
Could you help me?

jaikiran added the needs triage label Aug 26, 2020

andy-mishechkin added investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu Area: Java and removed needs triage labels Aug 26, 2020

maxim-lobanov self-assigned this Aug 26, 2020

maxim-lobanov closed this as completed Sep 4, 2020

adamhass mentioned this issue Jan 18, 2021

Ubuntu-latest Job fails after 1 h and doesn't display logs. #2475

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java jobs using ubuntu-latest get cancelled/skipped midway with no logs available #1491

Java jobs using ubuntu-latest get cancelled/skipped midway with no logs available #1491

jaikiran commented Aug 26, 2020 •

edited

Loading

maxim-lobanov commented Aug 26, 2020

jaikiran commented Aug 26, 2020

jaikiran commented Aug 26, 2020

jaikiran commented Aug 26, 2020

maxim-lobanov commented Aug 26, 2020

jaikiran commented Aug 27, 2020

maxim-lobanov commented Sep 4, 2020

jaikiran commented Sep 4, 2020 •

edited

Loading

projectsal commented Jun 10, 2022 •

edited

Loading

Java jobs using ubuntu-latest get cancelled/skipped midway with no logs available #1491

Java jobs using ubuntu-latest get cancelled/skipped midway with no logs available #1491

Comments

jaikiran commented Aug 26, 2020 • edited Loading

maxim-lobanov commented Aug 26, 2020

jaikiran commented Aug 26, 2020

jaikiran commented Aug 26, 2020

jaikiran commented Aug 26, 2020

maxim-lobanov commented Aug 26, 2020

jaikiran commented Aug 27, 2020

maxim-lobanov commented Sep 4, 2020

jaikiran commented Sep 4, 2020 • edited Loading

projectsal commented Jun 10, 2022 • edited Loading

jaikiran commented Aug 26, 2020 •

edited

Loading

jaikiran commented Sep 4, 2020 •

edited

Loading

projectsal commented Jun 10, 2022 •

edited

Loading