ci: reduce integration workflow run time#1313
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1313 +/- ##
===========================================
+ Coverage 19.37% 43.51% +24.13%
===========================================
Files 240 305 +65
Lines 28045 32864 +4819
===========================================
+ Hits 5434 14301 +8867
+ Misses 21969 17643 -4326
- Partials 642 920 +278
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes CI workflow execution time by implementing Go build caching and intelligent test distribution. The changes enable caching for integration test compilation (reducing re-run times by several minutes), add concurrency controls to cancel stale workflow runs, and introduce a weighted bin-packing algorithm to better balance test shards based on actual test execution times.
Changes:
- Enabled Go module and build caching for integration tests with security annotations
- Implemented LPT bin-packing algorithm for test shard distribution using historical test durations
- Added concurrency controls across all workflows to cancel outdated runs during PR iterations
- Removed path filters from push events on main/release branches to ensure cache is always built
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/update-test-weights.sh | New script to extract test durations from CI logs and generate weights file |
| scripts/integration-test-weights.generated.json | Generated JSON file containing test duration weights for 84 integration tests |
| scripts/generate-integration-matrix.sh | Updated matrix generation to use weighted bin-packing instead of random distribution |
| .github/workflows/pull_request_integration_tests.yml | Enabled Go module cache and added explicit build cache with security annotations |
| .github/workflows/workflow_integration_tests_vm.yml | Removed path filters from push events, added concurrency control |
| .github/workflows/pull_request_oats_test.yml | Removed path filters from push events, added concurrency control |
| .github/workflows/pull_request_k8s_integration_tests.yml | Removed path filters from push events, added concurrency control |
| .github/workflows/pull_request_integration_tests_arm.yml | Removed path filters from push events, added concurrency control |
| .github/workflows/pull_request_docker_build_test.yml | Added concurrency control |
| .github/workflows/pull_request.yml | Added concurrency control |
| .github/workflows/markdown-fail-fast.yml | Added branches filter to push events, removed paths filter, added concurrency control |
| .github/workflows/lint_darwin.yml | Removed path filters from push events, added concurrency control |
| .github/workflows/java-agent.yml | Removed path filters from push events, added concurrency control |
| .github/workflows/clang-tidy-check.yml | Added concurrency control |
| .github/workflows/clang-format-check.yml | Added concurrency control |
| .github/workflows/check_gh_actions_security.yml | Added concurrency control |
| .gitattributes | Marked integration-test-weights.generated.json as linguist-generated |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
I've been looking at improving the wall clock time of CI (and ease the annoyance of having to re-run failed shards). Compilation seems to be a common bottleneck (~5 mins based on the gap between downloads completed and tests running). I initially considered adding a sequential pre-compilation of the test binary, but this would actually not reduce wall clock time because all shards would have to wait for the compilation (which they currently do in parallel).
This PR enables Go build caching for integration test shards to meaningfully reduce CI wall clock time by speeding up compilation on cache-hit runs:
main, PRs can inherit themaincache, either completely or incrementally depending on the PR changesNote
I removed the
pathstrigger restrictions frommainandrelease-*branches, so that relevant workflows ALWAYS run on these branches. I also addedconcurrencyto cancel stale workflow runs when new commits are pushed to a branch, to help reduce the total number of runners during PR iterations (tested during multiple commits on this PR, you can see some cancelled workflows on previous commits).The cache has been disabled previously presumably due to security concerns, which I've addressed below:
Security summary
refs/pull/.../mergeand cannot be read bymainor other PRs (GitHub docs)~/.cache/go-buildviaactions/cache)go help cache)~/.cache/go-buildcontains only compiled.afiles;~/go/pkg/modcontains only public module source code. No secrets, credentials, or tokens are present in either pathrunner.osandrunner.archto prevent x86_64/ARM64 mix-upsPerformance impact
Wall clock impact
I wasn't happy with the above, the slowest randomly-assigned shard is still able to have a huge impact on the wall clock time. To address this I added a new commit df37be7 which uses the individual test run time estimates to better distribute the tests amongst the shards.
We can't use the
go test -listthat other actions use to keep the test runtime history (because that executesinitandTestMainfunctions that spin up containers etc), so I added a new script./scripts/update-test-weights.shwhich can read the test run times from a manually downloaded CI log archive. We don't need exact times of every test, we just need to know which are the slowest tests and ensure they're distributed appropriately. So this only needs to be manually updated every so often (when shards appear to be unbalanced).Example:
./scripts/update-test-weights.sh ~/Downloads/logs_57563497044Incremental comparison with table above:
Bottom line: ~8 CI minutes saved and ~5 minutes faster by the wall clock overall.