Split slow ecosystems across multiple CI jobs by jurre · Pull Request #5568 · dependabot/dependabot-core

jurre · 2022-08-22T12:32:10Z

The python tests are too damn slow. This is something that we should
fix at the core, by making them faster, but until we do that, let's not
waste precious time waiting on those tests to pass, let's throw more
computers at the problem.

This splits the tests up across multiple CI jobs using the parallel_test gem.

You'll notice the builds are still all fairly slow in this PR, but that's because the docker build step isn't being cached here, because we've passed some extra arguments to that step. It will be once this PR is merged.

You'll also see a few required checks now no longer pass, because their names have been changed. I'll update those once this PR is accepted.

There's a few other suites that take more than 5 minutes as well, so I've opted to also split those up.

The python tests are too damn slow. This is something that we _should_ fix at the core, by making them faster, but until we do that, let's not waste precious time waiting on those tests to pass, let's throw more computers at the problem. This is a relatively crude approach, but it should result in stable reruns. There's probably a way to make this cleaner by moving it into a rake task or something, so we can reuse it for more ecosystems. I'll leave that as an exercise for someone else that feels strongly about this.

There's probably a better way to exclude ruby scripts, or maybe move it to a rake task/different directory

jurre · 2022-08-22T13:22:00Z

Based on just glancing at the test speeds in the actions outputs of a few recent builds, I would recommend we do this for a few more suites:

At that point it might be worth looking if we can somehow reuse the docker images across all those suites instead of building from scratch for each one, but previous experimentation has shown that building for every job was slightly faster because copying the image from storage was relatively slow. It's a lot of wasted compute though, and might have gotten faster?

~~Not going to do it in this PR though, as this seems like a good improvement on it's own and I don't have more spare cycles to spend on this right now.~~

python/script/parallel-test

mattt · 2022-08-22T13:35:29Z

Nice work, @jurre! Adding some links for posterity:

Python: Move some more tests to the slow suite #4564
Attempt to run tests in parallel #4627
SharedHelpers#with_git_configured is non-thread safe #4909: Why we can't run tests in parallel automatically with parallel_rspec (if we fix this, we might be able to keep 1:1 between ecosystems and CI matrix)

landongrindheim · 2022-08-22T13:38:17Z

I'm seeing some pretty big variation between what's being run. I wonder if we could sort the tests being run into buckets before shuffling (though I understand that introduces some brittleness), so that we could make sure they're somewhat evenly distributed?

Suite 0

Tests run: 7

Suite 1

Tests run: 0

Suite 2

Tests run: 4

Suite 3

Tests run: 12

Suite 4

Tests run: 0

jurre · 2022-08-22T13:43:20Z

@landongrindheim I can't think of a way to do it easily, we'd have to keep track of every files average runtime somewhere or manually tag the files which is annoying and very easy to forget when adding new files.

In some runs I've seen the distribution be better, but it is indeed unpredictable with this approach.

The python_slow suite should already pull out some excessively slow tests into it's own suite, we can re-evaluate that and see if we can move things there.

But with all those approaches you always keep running into the fact that some files are just much slower than others, and you could keep track of every test individually and try to match them up as best as possible etc, but it seems very involved.

There is a service that offers something like this: https://knapsackpro.com/, but I'm not sure if we can get it approved or if it's even worth it.

If you have an idea on how to do it, I'd suggest we put it on a backlog somewhere to pick up relatively soon but move forward with this approach for now. Let's not make perfect the enemy of better here

deivid-rodriguez · 2022-08-22T13:52:53Z

The parallel_tests gem also has that feature of balancing test times. But there's the blockers to use this gem linked above, and also as @jurre mentioned it's tricky to implement even if the balancing itself is built-in, since it indeed requires keeping the file of recorded runtimes in sync. I guess it'd need to be kept as a build artifact or something, upload it after runs on pushes to main, and download it before running tests on PRs.

deivid-rodriguez · 2022-08-22T13:57:35Z

Something else I'm thinking, related to @landongrindheim's comment, is: do we need the shuffling at all? We cannot yet balance this properly, but maybe we can at least make test times stable by always running the same files on each runner?

landongrindheim

Nice work on this @jurre!

I think the conversation that's coming out of this change is definitely worth having, but I'd be disappointed if we hesitated on shipping this improvement because we hoped we could do better.

Let's keep talking about further improvements while benefiting from this one

jurre · 2022-08-22T14:13:46Z

The parallel_tests gem also has that feature of balancing test times. But there's the blockers to use this gem linked above, and also as @jurre mentioned it's tricky to implement even if the balancing itself is built-in, since it indeed requires keeping the file of recorded runtimes in sync. I guess it'd need to be kept as a build artifact or something, upload it after runs on pushes to main, and download it before running tests on PRs.

Yeah, and with parallel_tests we're also limited by the number of cores available in the actions VM's (which is 2 I think), so there are limits to what you can achieve with that approach.

Another thing I hope to look at soon is figuring out how much time in those tests is spent installing a new version of python, listing those versions out somewhere and preinstalling them in the CI image. That way we can cache them in between many runs. There must be more things like that where we can improve the actual runtime of the tests themselves, which seems like that would be the ultimate fix, but it's much more time consuming.

deivid-rodriguez · 2022-08-22T14:16:24Z

Let's keep talking about further improvements while benefiting from this one

Absolutely, my suggestions are definitely not blockers for shipping this, it's an awesome improvement already!

jurre · 2022-08-22T14:20:21Z

Something else I'm thinking, related to @landongrindheim's comment, is: do we need the shuffling at all? We cannot yet balance this properly, but maybe we can at least make test times stable by always running the same files on each runner?

Yeah we can try it and see if it happens to be a good distribution :D

deivid-rodriguez · 2022-08-22T14:27:17Z

It's going to change over time anyways, but I think it's better to have something stable (even if there's an unlucky machine that doesn't run anything like the example copied above) than something unstable.

deivid-rodriguez · 2022-08-22T15:58:54Z

Just for future reference, apparently parallel_tests also supports this exact use case: https://github.com/grosser/parallel_tests/wiki/Distributed-Parallel-Tests-on-CI-systems. By default it distributes tests by file size.

jurre · 2022-08-22T16:00:55Z

Just for future reference, apparently parallel_tests also supports this exact use case: https://github.com/grosser/parallel_tests/wiki/Distributed-Parallel-Tests-on-CI-systems. By default it distributes tests by file size.

Ah that's an interesting option, we should try it, I think it should work out of the box

deivid-rodriguez

I noticed that required jobs are pending because all required job names are now changed. I left an idea that I'm not sure it works, but would leave us with more readable job names, and without needing to change all required jobs in UI configuration manually.

.github/workflows/ci.yml

andrcuns · 2022-08-22T16:35:23Z

At that point it might be worth looking if we can somehow reuse the docker images across all those suites instead of building from scratch for each one, but previous experimentation has shown that building for every job was slightly faster because copying the image from storage was relatively slow. It's a lot of wasted compute though, and might have gotten faster?

@jurre just an FYI, when I was experimenting with arm image builds, I did just that, build image only once and transfer via github artifacts. The image is pretty beefy, it takes roughly 6 minutes to upload the image tar to gh artifacts and something like 30 seconds to download it in the next job.

It is also possible to simply publish dev docker images in to the registry, it will definitely be faster since it will allow reusing existing layers but will create a fair amount of dev tags bloat. Some kind of cleanup setup will be required.

python/script/ci-test

jurre · 2022-08-22T18:44:50Z

It seems a bit slower using parallel_test, but maybe that's a coincidence.

Now that it's this easy, should I split up a couple other ecosystems as well?

landongrindheim

⚡

landongrindheim · 2022-08-22T20:08:48Z

.github/workflows/ci.yml

+          - { path: python, name: python_slow, ci_node_total: 5, ci_node_index: 3 }
+          - { path: python, name: python_slow, ci_node_total: 5, ci_node_index: 4 }
+          - { path: pub, name: pub, ci_node_total: 2, ci_node_index: 0 }
+          - { path: pub, name: pub, ci_node_total: 2, ci_node_index: 1 }
+          - { path: terraform, name: terraform, ci_node_total: 1, ci_node_index: 0 }


Please forgive my YAML syntax ignorance 😇

Is it possible to use space inside these curly braces to align these arguments? Ex:

- { path: pub, name: pub, ci_node_total: 2, ci_node_index: 1 } - { path: terraform, name: terraform, ci_node_total: 1, ci_node_index: 0 }

It'd make it easier for me to read.

The spaces should be perfectly fine IIRC. However, in VS Code, I tend to just right-click in YAML files and select "format document" and it auto-spaces things... whatever it spits out might be a sensible default?

Also, I filed:

Run yamllint on PR's #5572

As I've been bit before by unexpected yaml issues.

jeffwidman

Very clever!

jeffwidman · 2022-08-23T03:35:58Z

bundler/script/ci-test

 # NOTE: Don't use `if` branches without `else` part, since the code in some of
 # then seems to not abort the script regardless of `set -e`

+# Should we only run these on one of the CI_NODE_INDEX's?


Is this a future TODO that should be merged in or does it need addressing/removing before merge?

I was treating it as a future TODO. Should be easy to tackle, but I don't really have more spare cycles to burn on this

ilyazub · 2023-04-24T16:37:53Z

It seems a bit slower using parallel_test, but maybe that's a coincidence.

CI specs run using a single core even when ci_node_total is greater than 1.

Examples:

1 process for 8 specs, ~ 8 specs per process: https://github.com/dependabot/dependabot-core/actions/runs/4661988395/jobs/8251872425
1 process for 16 specs, ~ 16 specs per process: https://github.com/dependabot/dependabot-core/actions/runs/4751708654/jobs/8441161997
1 process for 16 specs, ~ 16 specs per process: https://github.com/dependabot/dependabot-core/actions/runs/4669826594/jobs/8268798949#step:5:168

jurre requested a review from a team as a code owner August 22, 2022 12:32

Only ShellCheck ci-test script

dcc484c

There's probably a better way to exclude ruby scripts, or maybe move it to a rake task/different directory

jurre force-pushed the jurre/parallel-python-tests branch from 0e93dc3 to dcc484c Compare August 22, 2022 12:56

jurre commented Aug 22, 2022

View reviewed changes

python/script/parallel-test Outdated Show resolved Hide resolved

landongrindheim approved these changes Aug 22, 2022

View reviewed changes

Try not shuffling tests for consistent distribution

e8771de

Profile 10 slowest tests of each python run

aab5cc9

jurre force-pushed the jurre/parallel-python-tests branch from 1e00893 to 081b854 Compare August 22, 2022 16:13

deivid-rodriguez reviewed Aug 22, 2022

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Outdated Show resolved Hide resolved

jurre force-pushed the jurre/parallel-python-tests branch 3 times, most recently from d94d9b6 to 8eb45b6 Compare August 22, 2022 17:41

Use parallel_tests gem instead of hacky script

f13ba46

deivid-rodriguez reviewed Aug 22, 2022

View reviewed changes

python/script/ci-test Outdated Show resolved Hide resolved

jurre force-pushed the jurre/parallel-python-tests branch from 8eb45b6 to f13ba46 Compare August 22, 2022 18:15

jurre changed the title ~~Split python tests across 5 workflows~~ Split slow ecosystems across multiple CI jobs Aug 22, 2022

Parallelize more things

09e997e

jurre force-pushed the jurre/parallel-python-tests branch from 048562f to 09e997e Compare August 22, 2022 19:40

landongrindheim approved these changes Aug 22, 2022

View reviewed changes

jeffwidman approved these changes Aug 23, 2022

View reviewed changes

jurre merged commit 6243dce into main Aug 23, 2022

jurre deleted the jurre/parallel-python-tests branch August 23, 2022 06:51

pavera mentioned this pull request Aug 23, 2022

v0.210.0 #5574

Merged

Conversation

jurre commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jurre commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mattt commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

landongrindheim commented Aug 22, 2022

Suite 0

Suite 1

Suite 2

Suite 3

Suite 4

Uh oh!

jurre commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deivid-rodriguez commented Aug 22, 2022

Uh oh!

deivid-rodriguez commented Aug 22, 2022

Uh oh!

landongrindheim left a comment

Choose a reason for hiding this comment

Uh oh!

jurre commented Aug 22, 2022

Uh oh!

deivid-rodriguez commented Aug 22, 2022

Uh oh!

jurre commented Aug 22, 2022

Uh oh!

deivid-rodriguez commented Aug 22, 2022

Uh oh!

deivid-rodriguez commented Aug 22, 2022

Uh oh!

jurre commented Aug 22, 2022

Uh oh!

deivid-rodriguez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andrcuns commented Aug 22, 2022

Uh oh!

Uh oh!

jurre commented Aug 22, 2022

Uh oh!

landongrindheim left a comment

Choose a reason for hiding this comment

Uh oh!

landongrindheim Aug 22, 2022

Choose a reason for hiding this comment

Uh oh!

jeffwidman Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffwidman left a comment

Choose a reason for hiding this comment

Uh oh!

jeffwidman Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

jurre Aug 23, 2022

Choose a reason for hiding this comment

Uh oh!

ilyazub commented Apr 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jurre commented Aug 22, 2022 •

edited

Loading

jurre commented Aug 22, 2022 •

edited

Loading

mattt commented Aug 22, 2022 •

edited

Loading

jurre commented Aug 22, 2022 •

edited

Loading

jeffwidman Aug 22, 2022 •

edited

Loading