Skip to content

Conversation

@mhucka
Copy link
Member

@mhucka mhucka commented Dec 8, 2025

Thanks to the help of Google's ML Velocity team, the TensorFlow Quantum project has access to larger job runners. We can use them for the really time-consuming jobs in the CI build checks workflow, which are the library build and the tutorial tests.

In addition, this PR makes a few other small adjustments:

  • The tutorial-tests job does not need to depend on wheel-build to finish. (Maybe it did in the past?) Consequently, we can take out the dependency and have all 3 jobs run in parallel, which speeds up on the overall workflow.

  • In two places where ./configure.sh was invoked, that step was immediately followed by a step to run ./scripts/build_pip_package_test.sh, which also runs ./configure.sh. We can take out the redundant invocations of configure.sh in this workflow.

Note: I didn't change the wheel-build job to use the new runners because it's not a bottleneck – the most time consuming job in here is the tutorials tests – the wheel-build job is not the bottleneck. The larger runners are more expensive ($/per minute) to run, so if we can't benefit from them, it doesn't make sense to use them.

Here is an example of changed workflow run-times. First, a sample of what it is before changes:

image

And now with the workflow changes:

image

A typical run of the build tests has gone from ~22 minutes to ~7 min (approx 1/3 of what it used to be); that speedup is due to the use of the new ML team runners. The overall time has gone down from ~24 min to ~16 min, or about 2/3 of what it used to be. The bottleneck is the tutorial tests. The time for doing the tutorial tests has barely improved because the tutorial test script does not take advantage of parallelism. (Something to be improved in the future.)

In two places where `./configure.sh` was invoked, the step was
immediately followed by running `./scripts/build_pip_package_test.sh`,
which also runs `./configure.sh`.
Maybe there was a dependency between them in the past, but there isn't
one now. Removing the `needs:` property lets all 3 jobs run in
parallel, speeding up the whole CI checks workflow.
Thanks to the help of Google's ML Velocity team, the TensorFlow Quantum
project has access to larger job runners. We can use them for the really
time-consuming jobs in our workflows.

It took a fair amount of trial-and-error testing to resolve some odd
differences in the runner environments, but eventually I got it down to
just a couple of additional commands.

Note: I didn't change the `wheel-build` job to use the new runners
because the most time-consuming job in here is the tutorials tests and
the wheel-build job is not the bottleneck. The larger runners are more
expensive, so if we can't benefit from them, it doesn't make sense to
use them.
@mhucka mhucka added the area/devops Involves build systems, Make files, Bazel files, continuous integration, and/or other DevOps topics label Dec 8, 2025
@mhucka mhucka marked this pull request as ready for review December 8, 2025 05:36
@mhucka mhucka enabled auto-merge (squash) December 8, 2025 18:00
@mhucka mhucka merged commit ca6e113 into tensorflow:master Dec 8, 2025
10 checks passed
@mhucka mhucka deleted the mh-use-ml-runners branch December 8, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/devops Involves build systems, Make files, Bazel files, continuous integration, and/or other DevOps topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants