Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI timing out during puppeteer browser download #1843

Closed
jameshadfield opened this issue Aug 29, 2024 · 7 comments · Fixed by #1884
Closed

CI timing out during puppeteer browser download #1843

jameshadfield opened this issue Aug 29, 2024 · 7 comments · Fixed by #1884
Assignees

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Aug 29, 2024

A typical CI run calls npm ci 18 times. While this mostly just works, npm ci stochastically takes a looong time (couple of examples from the past week where it hit the 6 hour time limit for an action) which means stochastic CI failures. Reducing the number of npm ci calls is an obvious way to make this less likely to affect a job the action.

Instead of running independent jobs for build, unit-test, smoke-test, type-check, bundlesize, check-lockfile we could collect them into a single job . Across a matrix of 4 node versions this would reduce the npm ci calls from 18 to 4.

This has the downside of not showing each step in the LHS sidebar's "Jobs summary", we'd just see 4 jobs of "test (node version)" or similar. Using if: always() for steps should allow them to run even if a preceeding step failed whilst still giving them the red check mark; continue-on-error: true for the job may do the same.

@joverlee521
Copy link
Contributor

Instead of restructuring the GH Action workflow, could setup-node's cache help reduce the time?

@victorlin
Copy link
Member

Instead of restructuring the GH Action workflow, could setup-node's cache help reduce the time?

+1 for not restructuring the GH Action workflow yet (it is parallelized to reduce overall CI run time), but I'm not sure that adding cache will fix the issue which seems more like a timeout than just slowness. We should first investigate the root cause.

These are some questions that came to mind. The following query and some internet searching provided some insight.

gh run list --workflow ci --status failure

STATUS  TITLE                   WORKFLOW  BRANCH             EVENT         ID           ELAPSED  AGE              
X       Merge pull request ...  ci        master             push          10623059679  6h0m26s  about 20 hours...
X       Remove custom LBI c...  ci        remove-lbi-col...  pull_request  10622362430  6h0m26s  about 21 hours...
X       Bump webpack from 5...  ci        dependabot/npm...  pull_request  10605318313  6h0m27s  about 1 day ago
X       CI: Fail on ESLint ...  ci        victorlin/ci-f...  pull_request  10604572248  6m54s    about 1 day ago
X       CI: Fail on ESLint ...  ci        victorlin/ci-f...  pull_request  10604224195  6m58s    about 1 day ago
X       Scatterplots work w...  ci        james/yyyy-mm-...  pull_request  10588576994  6h0m26s  about 2 days ago
X       update release docs     ci        james/release-...  pull_request  10517503671  6h0m29s  about 7 days ago
X       Limit x-axis by fil...  ci        measurements-x...  pull_request  10427440548  6h0m33s  about 13 days ago
X       Limit x-axis by fil...  ci        measurements-x...  pull_request  10427069236  6h0m27s  about 13 days ago
X       Limit x-axis by fil...  ci        measurements-x...  pull_request  10425983824  6h0m37s  about 13 days ago
X       Merge pull request ...  ci        master             push          10407900455  6h0m25s  about 15 days ago
X       Fix broken links [#...  ci        fix-broken-lin...  pull_request  10378824462  6h0m23s  about 16 days ago
X       Only support downlo...  ci        download-error     pull_request  10273500575  6h0m27s  about 23 days ago
X       Add support for Nod...  ci        node-22            pull_request  9385918467   3m10s    about 2 months...
X       Add support for Nod...  ci        node-22            pull_request  9385864734   2m18s    about 2 months...
X       Upgrade to React 18     ci        react-18           pull_request  7823841070   2m34s    about 6 months...
X       Upgrade to React 18     ci        react-18           pull_request  7823773392   2m26s    about 6 months...
X       Enable filtering fo...  ci        james/search-all   pull_request  7717066781   2m22s    about 7 months...
X       Improvements to upd...  ci        victorlin/upda...  pull_request  7186505267   3m13s    about 8 months...
X       Improvements to upd...  ci        victorlin/upda...  pull_request  7175624884   2m37s    about 8 months...

Is this specific to node=16? The two examples linked in the issue description both failed on node=16.

No. 10427440548 failed on node=18, 10605318313 failed on node=20, and 10517503671 failed on node=22.

When did this start happening?

The first occurence was 23 days ago with 10273500575.

Can we figure out why npm ci is hanging stochastically?

I looked for existing issues and found these:

The current output is not very informative. I think the first thing to do is to run with npm ci --loglevel verbose. Hopefully with more output we can get a better sense of where it is hanging.

@victorlin
Copy link
Member

I hadn't seen this in a while, but just noticed it again in https://github.com/nextstrain/auspice/actions/runs/11469100955/job/31915629485

@victorlin
Copy link
Member

npm ci is hanging on these two jobs while releasing 2.59.1:

Seems more reproducible today so I'll start investigating.

victorlin added a commit that referenced this issue Oct 23, 2024
Trying to catch and dissect the elusive fish that is npm ci's stochastic
timeout.¹

¹ #1843
@victorlin
Copy link
Member

I've re-run this one several times with no luck. --loglevel verbose doesn't seem to add any significant time delay to the runs, so I'm thinking we can just add it everywhere in the meantime. If others are on board with this, I'll merge #1876.

@victorlin victorlin self-assigned this Oct 24, 2024
@victorlin
Copy link
Member

npm ci timed out again yesterday, this time with added verbose logs.

Here's the log:

npm http fetch GET 200 https://registry.npmjs.org/puppeteer 85ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/jest-puppeteer 54ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/pac-proxy-agent 34ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/proxy-agent 39ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/@puppeteer%2fbrowsers 60ms (cache miss)
<6 hours later>
##[error]The operation was canceled.

While that was running, another npm ci call on the same workflow run was successful in a parallel job. This makes for a good comparison. Here's the log:

npm http fetch GET 200 https://registry.npmjs.org/puppeteer 76ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/jest-puppeteer 43ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/pac-proxy-agent 25ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/proxy-agent 37ms (cache miss)
npm http fetch GET 200 https://registry.npmjs.org/@puppeteer%2fbrowsers 48ms (cache miss)
<5 minutes later>
npm info run [email protected] postinstall { code: 0, signal: null }

> [email protected] prepare
> npm run build

…

This looks similar to #1709 (comment) where puppeteer browser download was timing out on Heroku.

It seems like we aren't actively using puppeteer: usage was removed in 2047696. I think properly getting rid of all references to it, including the dev dependency, will solve this issue. That shouldn't discourage us from bringing it back in the future though. We're downloading a deprecated version of the package so my wishful thinking is that newer versions won't come with this issue.

@victorlin
Copy link
Member

I misspoke above. We are using jest-puppeteer for tests such as:

const resp = await page.goto(url, { waitUntil: 'networkidle2' });

I'll look into either replacing that or upgrading puppeteer.

@victorlin victorlin changed the title Reduce npm ci calls in CI to reduce stochastic failures CI timing out during puppeteer browser download Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants