Skip to content

Split slow ecosystems across multiple CI jobs#5568

Merged
jurre merged 6 commits intomainfrom
jurre/parallel-python-tests
Aug 23, 2022
Merged

Split slow ecosystems across multiple CI jobs#5568
jurre merged 6 commits intomainfrom
jurre/parallel-python-tests

Conversation

@jurre
Copy link
Copy Markdown
Member

@jurre jurre commented Aug 22, 2022

The python tests are too damn slow. This is something that we should
fix at the core, by making them faster, but until we do that, let's not
waste precious time waiting on those tests to pass, let's throw more
computers at the problem.

This splits the tests up across multiple CI jobs using the parallel_test gem.

You'll notice the builds are still all fairly slow in this PR, but that's because the docker build step isn't being cached here, because we've passed some extra arguments to that step. It will be once this PR is merged.

You'll also see a few required checks now no longer pass, because their names have been changed. I'll update those once this PR is accepted.

There's a few other suites that take more than 5 minutes as well, so I've opted to also split those up.

The python tests are too damn slow. This is something that we _should_
fix at the core, by making them faster, but until we do that, let's not
waste precious time waiting on those tests to pass, let's throw more
computers at the problem.

This is a relatively crude approach, but it should result in stable
reruns. There's probably a way to make this cleaner by moving it into a
rake task or something, so we can reuse it for more ecosystems. I'll
leave that as an exercise for someone else that feels strongly about
this.
@jurre jurre requested a review from a team as a code owner August 22, 2022 12:32
There's probably a better way to exclude ruby scripts, or maybe move it
to a rake task/different directory
@jurre jurre force-pushed the jurre/parallel-python-tests branch from 0e93dc3 to dcc484c Compare August 22, 2022 12:56
@jurre
Copy link
Copy Markdown
Member Author

jurre commented Aug 22, 2022

Based on just glancing at the test speeds in the actions outputs of a few recent builds, I would recommend we do this for a few more suites:

  • bundler1: split into 2 jobs
  • bundler2: split into 2 jobs
  • composer: split into 2 jobs
  • go_modules: split into 3 jobs
  • hex: split into 3 jobs
  • npm_and_yarn: split into 5 jobs
  • pub: split into 2 jobs

At that point it might be worth looking if we can somehow reuse the docker images across all those suites instead of building from scratch for each one, but previous experimentation has shown that building for every job was slightly faster because copying the image from storage was relatively slow. It's a lot of wasted compute though, and might have gotten faster?

Not going to do it in this PR though, as this seems like a good improvement on it's own and I don't have more spare cycles to spend on this right now.

@mattt
Copy link
Copy Markdown
Contributor

mattt commented Aug 22, 2022

Nice work, @jurre! Adding some links for posterity:

@landongrindheim
Copy link
Copy Markdown
Contributor

I'm seeing some pretty big variation between what's being run. I wonder if we could sort the tests being run into buckets before shuffling (though I understand that introduces some brittleness), so that we could make sure they're somewhat evenly distributed?

Suite 0

image

Tests run: 7

Suite 1

image

Tests run: 0

Suite 2

image

Tests run: 4

Suite 3

image

Tests run: 12

Suite 4

image

Tests run: 0

@jurre
Copy link
Copy Markdown
Member Author

jurre commented Aug 22, 2022

@landongrindheim I can't think of a way to do it easily, we'd have to keep track of every files average runtime somewhere or manually tag the files which is annoying and very easy to forget when adding new files.

In some runs I've seen the distribution be better, but it is indeed unpredictable with this approach.

The python_slow suite should already pull out some excessively slow tests into it's own suite, we can re-evaluate that and see if we can move things there.

But with all those approaches you always keep running into the fact that some files are just much slower than others, and you could keep track of every test individually and try to match them up as best as possible etc, but it seems very involved.

There is a service that offers something like this: https://knapsackpro.com/, but I'm not sure if we can get it approved or if it's even worth it.

If you have an idea on how to do it, I'd suggest we put it on a backlog somewhere to pick up relatively soon but move forward with this approach for now. Let's not make perfect the enemy of better here

@deivid-rodriguez
Copy link
Copy Markdown
Contributor

The parallel_tests gem also has that feature of balancing test times. But there's the blockers to use this gem linked above, and also as @jurre mentioned it's tricky to implement even if the balancing itself is built-in, since it indeed requires keeping the file of recorded runtimes in sync. I guess it'd need to be kept as a build artifact or something, upload it after runs on pushes to main, and download it before running tests on PRs.

@deivid-rodriguez
Copy link
Copy Markdown
Contributor

Something else I'm thinking, related to @landongrindheim's comment, is: do we need the shuffling at all? We cannot yet balance this properly, but maybe we can at least make test times stable by always running the same files on each runner?

Copy link
Copy Markdown
Contributor

@landongrindheim landongrindheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this @jurre!

I think the conversation that's coming out of this change is definitely worth having, but I'd be disappointed if we hesitated on shipping this improvement because we hoped we could do better.

Let's keep talking about further improvements while benefiting from this one :shipit:

@jurre
Copy link
Copy Markdown
Member Author

jurre commented Aug 22, 2022

The parallel_tests gem also has that feature of balancing test times. But there's the blockers to use this gem linked above, and also as @jurre mentioned it's tricky to implement even if the balancing itself is built-in, since it indeed requires keeping the file of recorded runtimes in sync. I guess it'd need to be kept as a build artifact or something, upload it after runs on pushes to main, and download it before running tests on PRs.

Yeah, and with parallel_tests we're also limited by the number of cores available in the actions VM's (which is 2 I think), so there are limits to what you can achieve with that approach.

Another thing I hope to look at soon is figuring out how much time in those tests is spent installing a new version of python, listing those versions out somewhere and preinstalling them in the CI image. That way we can cache them in between many runs. There must be more things like that where we can improve the actual runtime of the tests themselves, which seems like that would be the ultimate fix, but it's much more time consuming.

@deivid-rodriguez
Copy link
Copy Markdown
Contributor

Let's keep talking about further improvements while benefiting from this one :shipit:

Absolutely, my suggestions are definitely not blockers for shipping this, it's an awesome improvement already!

@jurre
Copy link
Copy Markdown
Member Author

jurre commented Aug 22, 2022

Something else I'm thinking, related to @landongrindheim's comment, is: do we need the shuffling at all? We cannot yet balance this properly, but maybe we can at least make test times stable by always running the same files on each runner?

Yeah we can try it and see if it happens to be a good distribution :D

@deivid-rodriguez
Copy link
Copy Markdown
Contributor

It's going to change over time anyways, but I think it's better to have something stable (even if there's an unlucky machine that doesn't run anything like the example copied above) than something unstable.

@deivid-rodriguez
Copy link
Copy Markdown
Contributor

Just for future reference, apparently parallel_tests also supports this exact use case: https://github.com/grosser/parallel_tests/wiki/Distributed-Parallel-Tests-on-CI-systems. By default it distributes tests by file size.

@jurre
Copy link
Copy Markdown
Member Author

jurre commented Aug 22, 2022

Just for future reference, apparently parallel_tests also supports this exact use case: https://github.com/grosser/parallel_tests/wiki/Distributed-Parallel-Tests-on-CI-systems. By default it distributes tests by file size.

Ah that's an interesting option, we should try it, I think it should work out of the box

@jurre jurre force-pushed the jurre/parallel-python-tests branch from 1e00893 to 081b854 Compare August 22, 2022 16:13
Copy link
Copy Markdown
Contributor

@deivid-rodriguez deivid-rodriguez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that required jobs are pending because all required job names are now changed. I left an idea that I'm not sure it works, but would leave us with more readable job names, and without needing to change all required jobs in UI configuration manually.

@andrcuns
Copy link
Copy Markdown
Contributor

At that point it might be worth looking if we can somehow reuse the docker images across all those suites instead of building from scratch for each one, but previous experimentation has shown that building for every job was slightly faster because copying the image from storage was relatively slow. It's a lot of wasted compute though, and might have gotten faster?

@jurre just an FYI, when I was experimenting with arm image builds, I did just that, build image only once and transfer via github artifacts. The image is pretty beefy, it takes roughly 6 minutes to upload the image tar to gh artifacts and something like 30 seconds to download it in the next job.

It is also possible to simply publish dev docker images in to the registry, it will definitely be faster since it will allow reusing existing layers but will create a fair amount of dev tags bloat. Some kind of cleanup setup will be required.

@jurre jurre force-pushed the jurre/parallel-python-tests branch 3 times, most recently from d94d9b6 to 8eb45b6 Compare August 22, 2022 17:41
@jurre jurre force-pushed the jurre/parallel-python-tests branch from 8eb45b6 to f13ba46 Compare August 22, 2022 18:15
@jurre
Copy link
Copy Markdown
Member Author

jurre commented Aug 22, 2022

It seems a bit slower using parallel_test, but maybe that's a coincidence.

Now that it's this easy, should I split up a couple other ecosystems as well?

@jurre jurre changed the title Split python tests across 5 workflows Split slow ecosystems across multiple CI jobs Aug 22, 2022
@jurre jurre force-pushed the jurre/parallel-python-tests branch from 048562f to 09e997e Compare August 22, 2022 19:40
Copy link
Copy Markdown
Contributor

@landongrindheim landongrindheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +54 to +58
- { path: python, name: python_slow, ci_node_total: 5, ci_node_index: 3 }
- { path: python, name: python_slow, ci_node_total: 5, ci_node_index: 4 }
- { path: pub, name: pub, ci_node_total: 2, ci_node_index: 0 }
- { path: pub, name: pub, ci_node_total: 2, ci_node_index: 1 }
- { path: terraform, name: terraform, ci_node_total: 1, ci_node_index: 0 }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please forgive my YAML syntax ignorance 😇

Is it possible to use space inside these curly braces to align these arguments? Ex:

 - { path: pub,       name: pub,       ci_node_total: 2, ci_node_index: 1 }
 - { path: terraform, name: terraform, ci_node_total: 1, ci_node_index: 0 }

It'd make it easier for me to read.

Copy link
Copy Markdown
Member

@jeffwidman jeffwidman Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spaces should be perfectly fine IIRC. However, in VS Code, I tend to just right-click in YAML files and select "format document" and it auto-spaces things... whatever it spits out might be a sensible default?

Also, I filed:

As I've been bit before by unexpected yaml issues.

Copy link
Copy Markdown
Member

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clever!

# NOTE: Don't use `if` branches without `else` part, since the code in some of
# then seems to not abort the script regardless of `set -e`

# Should we only run these on one of the CI_NODE_INDEX's?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a future TODO that should be merged in or does it need addressing/removing before merge?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was treating it as a future TODO. Should be easy to tackle, but I don't really have more spare cycles to burn on this

@jurre jurre merged commit 6243dce into main Aug 23, 2022
@jurre jurre deleted the jurre/parallel-python-tests branch August 23, 2022 06:51
@pavera pavera mentioned this pull request Aug 23, 2022
@ilyazub
Copy link
Copy Markdown
Contributor

ilyazub commented Apr 24, 2023

It seems a bit slower using parallel_test, but maybe that's a coincidence.

CI specs run using a single core even when ci_node_total is greater than 1.

Examples:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants