Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java jobs using ubuntu-latest get cancelled/skipped midway with no logs available #1491

Closed
jaikiran opened this issue Aug 26, 2020 · 9 comments
Assignees
Labels
Area: Java investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu

Comments

@jaikiran
Copy link

jaikiran commented Aug 26, 2020

The past few weeks, one of the projects that I contribute to (https://github.com/quarkusio/quarkus/) has been running into odd issues with the GitHub actions. The project uses GitHub actions to trigger a workflow on PR creation, which internally triggers around 31 jobs. Most of these jobs succeed. However, there are 2-3 jobs which are exhibiting odd failures the past few weeks. All our investigation so far has been almost blocked by the lack of logs from these jobs when they fail. The symptoms we have seen so far is the following:

  • The job starts running (it's expected to finish in around 2 hours or more and has been configured for a timeout of 4 hours)
  • After around more than an hour the GitHub actions UI, in the PR, shows that job as cancelled/skipped. At this point, if we go and look into the job, the logs are available (from the GitHub UI) for that job but those logs don't contain any specific reason explaining why the job was skipped or failed (while it was still running tests). The message we have seen so far for such jobs being cancelled is:
....
2020-08-24T10:35:46.2127997Z ##[error]The operation was canceled.
2020-08-24T10:35:47.0063961Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
  • This above thing happens for 2-3 jobs and they show the status as cancelled/skipped
  • Once all the jobs in that workflow have completed, the status of the 2-3 jobs which was previously showing as cancelled/skipped now ends up being shown as failed. At this point, if we go back to these jobs and check their logs, the logs have all disappeared and the GitHub UI for these jobs shows the following minimal log:
2020-08-24T16:45:37.8609235Z ##[section]Starting: Request a runner to run this job
2020-08-24T16:45:38.9997701Z Can't find any online and idle self-hosted runner in current repository that matches the required labels: 'ubuntu-latest'
2020-08-24T16:45:38.9997791Z Can't find any online and idle self-hosted runner in current repository's account/organization that matches the required labels: 'ubuntu-latest'
2020-08-24T16:45:38.9997999Z Found online and idle hosted runner in current repository's account/organization that matches the required labels: 'ubuntu-latest'
2020-08-24T16:45:39.3330338Z ##[section]Finishing: Request a runner to run this job

This issue has practically made it almost impossible for us to validate PRs for the past few weeks.

All these jobs are Java jobs (i.e. they run Java applications and tests) and all of these use "ubuntu-latest".

@maxim-lobanov, I see a similar issue reported for MacOS runners here #736 and I see that you asked users to report if they still see this issue. I didn't want to add these details there since this is ubuntu-latest images, so I created this one.

If you want to see a sample PR which exhibited this issue (which we anyway went ahead and merged), then here's one quarkusio/quarkus#11512 (I'm sure you know this, but if not, then to check these jobs that were run and failed, click on the "View Details" at the end of PR and select the "JDK Java8 JVM Tests" job for details and logs). If you look at the recent PRs in that repo almost all are affected by this. We have a discussion going on in our mailing list too, if it helps to set some context https://groups.google.com/d/msg/quarkus-dev/yfbiBPjm6cM/G1Na5I0TBgAJ

@maxim-lobanov
Copy link
Contributor

@jaikiran , thank you for report. It definitely something different from MacOS issue and I have never seen the same reports for ubuntu before and we have not changed infrastructure somehow. We will take a look.

A few questions:

  1. The one possible reason for such issue is that VM can be exhausted from resources and internet connection side. How big this project is and can it cause something similar?
  2. How many days ago have you started to experience this issue? Just for debug purpose, we can create test branch from commit where your builds were green and try to run builds on those changes, just to discard code changes factor.

@jaikiran
Copy link
Author

Hello @maxim-lobanov,

The one possible reason for such issue is that VM can be exhausted from resources and internet connection side. How big this project is and can it cause something similar?

It's (relatively) big and it does do things like:

  • One job within the workflow does an initial build which generates Maven artifacts (which are mostly jar files) locally and uses the upload-artifact action to upload a tar file out of these artifacts. The file size can/will be large (and in GBs) given how Maven (the Java build tool) works.

  • This uploaded artifact is then downloaded, using download-artifact action by each of these jobs as one of the initial steps. This step take very little time (few minutes) and has never failed (nor cancelled/skipped).

  • Once that download step is done each of these jobs do various things which can be classified as follows:

    • Runs Java applications as part of the test
    • Start "services" which can be databases and various other such processes
    • Use docker to pull images and start docker containers
    • Spawn multiple (mostly Java) processes in some of these tests
      The jobs which are running into this issue, each can/will do each of these above steps during its lifetime. So yes, they are resource intensive.

How many days ago have you started to experience this issue? Just for debug purpose, we can create test branch from commit where your builds were green and try to run builds on those changes, just to discard code changes factor.

We started observing this at least 3 weeks back (around 22 to 23 days back somewhere around 2nd August), I think. @gsmet who keeps a more detailed eye on these PRs in that project can confirm this or correct me if that isn't an accurate timeline.
I don't want to jump to conclusions, but my very basic investigation the past few days has given me an impression that this perhaps started happening very regularly with the 202008xx.x series of the images:

2020-08-06T14:50:50.9772796Z ##[endgroup]
2020-08-06T14:50:50.9773000Z ##[group]Virtual Environment
2020-08-06T14:50:50.9773294Z Environment: ubuntu-18.04
2020-08-06T14:50:50.9773582Z Version: 20200802.1
2020-08-06T14:50:50.9773951Z Included Software: https://github.com/actions/virtual-environments/blob/ubuntu18/20200802.1/images/linux/Ubuntu1804-README.md
2020-08-06T14:50:50.9774273Z ##[endgroup]

From what I can gather, we rarely saw this with 202007xx.x series. Of course there were few instances where I can find this issue in that series too, so please don't consider this as a definitive hint.

@jaikiran
Copy link
Author

By the way, this is what the workflow file looks like https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml (the one which triggers those 31 odd jobs) and the jobs that fail is defined here https://github.com/quarkusio/quarkus/blob/master/.github/workflows/ci-actions.yml#L100 (that file has seen some minor changes the past few days, but in its current form too it runs into the issue)

@andy-mishechkin andy-mishechkin added investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu Area: Java and removed needs triage labels Aug 26, 2020
@jaikiran
Copy link
Author

One of project members just replied on the Quarkus dev mailing list[1]:

apparently we get out of both memory and swap. And that's our main issue.

So it looks like these jobs run into resource issues (in our case the memory and swap) and get initally marked as cancelled/skipped and finally error out. So I think this now comes down to:

  • Would it be possible to report these underlying cause/resource usage problems in the logs of these failed jobs?
  • Can the logs generated during the lifetime of the job still stay intact without going missing?

[1] https://groups.google.com/d/msg/quarkus-dev/yfbiBPjm6cM/JJEMSX9pAAAJ

@maxim-lobanov
Copy link
Contributor

Such issues can't be reported better for now. From backend side, this case looks like VM has stopped heartbeat and stopped receiving any messages.

We do some internal work to improve such issues handling (freeze such machines, get all necessary logs and etc) but unfortunately, I can't provide any ETA since this work is still in progress

@maxim-lobanov maxim-lobanov self-assigned this Aug 26, 2020
@jaikiran
Copy link
Author

Thank for looking into this @maxim-lobanov

We do some internal work to improve such issues handling (freeze such machines, get all necessary logs and etc) but unfortunately, I can't provide any ETA since this work is still in progress

That's alright, I'll just keep a watch on this issue for the progress. Given that we have narrowed this down to the root cause, we are taking some measures to try and prevent this in the meantime in the Quarkus project.

@maxim-lobanov
Copy link
Contributor

I am closing this issue for now and we will announce separately when the work to improve handling of exhausted machines is done.
Please let us know if you need any additional help with investigation of current issue.

@jaikiran
Copy link
Author

jaikiran commented Sep 4, 2020

Please let us know if you need any additional help with investigation of current issue.

We managed to solve this issue with doing changes to our workflow jobs and fixing a few other things. So we are no longer running into resource exhaustion on these VMs. Thank you for your help on this issue.

@projectsal
Copy link

projectsal commented Jun 10, 2022

I have met the same problem.When I tried to build the app written by kotlin,it also throw the error "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.",and it do not give a detailed log.
The job is expected to complete in an hour and I set the timeout to 120 minutes.But when two minutes passed,the gradle build failed.The log just show "The job is expected to complete in an hour and I set the timeout to 120 minutes".
Here is the link.
Could you help me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Java investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu
Projects
None yet
Development

No branches or pull requests

4 participants