Skip to content

ci: reduce integration workflow run time#1313

Merged
MrAlias merged 12 commits into
open-telemetry:mainfrom
skl:skl/ci-updates
Feb 17, 2026
Merged

ci: reduce integration workflow run time#1313
MrAlias merged 12 commits into
open-telemetry:mainfrom
skl:skl/ci-updates

Conversation

@skl
Copy link
Copy Markdown
Member

@skl skl commented Feb 16, 2026

I've been looking at improving the wall clock time of CI (and ease the annoyance of having to re-run failed shards). Compilation seems to be a common bottleneck (~5 mins based on the gap between downloads completed and tests running). I initially considered adding a sequential pre-compilation of the test binary, but this would actually not reduce wall clock time because all shards would have to wait for the compilation (which they currently do in parallel).

This PR enables Go build caching for integration test shards to meaningfully reduce CI wall clock time by speeding up compilation on cache-hit runs:

  • first run when any Go file or dep is changed will take as long as it takes now
  • re-runs, when no file or dep has changed, should be significantly faster (by several minutes)
  • once this lands in main, PRs can inherit the main cache, either completely or incrementally depending on the PR changes
  • if all looks good, can apply same approach to other workflows

Note

I removed the paths trigger restrictions from main and release-* branches, so that relevant workflows ALWAYS run on these branches. I also added concurrency to cancel stale workflow runs when new commits are pushed to a branch, to help reduce the total number of runners during PR iterations (tested during multiple commits on this PR, you can see some cancelled workflows on previous commits).

The cache has been disabled previously presumably due to security concerns, which I've addressed below:

Security summary

Concern Mitigation
Cache poisoning from fork PRs Fork PR cache writes are scoped to refs/pull/.../merge and cannot be read by main or other PRs (GitHub docs)
Tampered build cache entries (~/.cache/go-build via actions/cache) Go detects out-of-date packages purely based on the content of source files, specified build flags, and metadata stored in the compiled packages; tampered cache entries simply cause a rebuild from source (go help cache)
Sensitive data in cache paths ~/.cache/go-build contains only compiled .a files; ~/go/pkg/mod contains only public module source code. No secrets, credentials, or tokens are present in either path
Cross-architecture cache collisions Cache key includes runner.os and runner.arch to prevent x86_64/ARM64 mix-ups

Performance impact

Workflow Step Before After
integration Total CI minutes ~3h 0m ~2h 51m
integration Slowest shard ~26m ~26m
integration Install gotestsum ~1m ~6s

Wall clock impact

I wasn't happy with the above, the slowest randomly-assigned shard is still able to have a huge impact on the wall clock time. To address this I added a new commit df37be7 which uses the individual test run time estimates to better distribute the tests amongst the shards.

We can't use the go test -list that other actions use to keep the test runtime history (because that executes init and TestMain functions that spin up containers etc), so I added a new script ./scripts/update-test-weights.sh which can read the test run times from a manually downloaded CI log archive. We don't need exact times of every test, we just need to know which are the slowest tests and ensure they're distributed appropriately. So this only needs to be manually updated every so often (when shards appear to be unbalanced).

Example:

./scripts/update-test-weights.sh ~/Downloads/logs_57563497044

Incremental comparison with table above:

Workflow Step Before After
integration Total CI minutes ~2h 51m ~2h 52m
integration Slowest shard ~26m ~21m
integration Install gotestsum ~1m ~6s

Bottom line: ~8 CI minutes saved and ~5 minutes faster by the wall clock overall.

@skl skl requested a review from a team as a code owner February 16, 2026 14:45
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.51%. Comparing base (19826d4) to head (d24de73).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1313       +/-   ##
===========================================
+ Coverage   19.37%   43.51%   +24.13%     
===========================================
  Files         240      305       +65     
  Lines       28045    32864     +4819     
===========================================
+ Hits         5434    14301     +8867     
+ Misses      21969    17643     -4326     
- Partials      642      920      +278     
Flag Coverage Δ
integration-test 21.49% <ø> (ø)
integration-test-arm 0.00% <ø> (ø)
integration-test-vm-x86_64-5.15.152 0.00% <ø> (ø)
integration-test-vm-x86_64-6.10.6 0.00% <ø> (ø)
k8s-integration-test 2.36% <ø> (ø)
oats-test 0.00% <ø> (ø)
unittests 44.29% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes CI workflow execution time by implementing Go build caching and intelligent test distribution. The changes enable caching for integration test compilation (reducing re-run times by several minutes), add concurrency controls to cancel stale workflow runs, and introduce a weighted bin-packing algorithm to better balance test shards based on actual test execution times.

Changes:

  • Enabled Go module and build caching for integration tests with security annotations
  • Implemented LPT bin-packing algorithm for test shard distribution using historical test durations
  • Added concurrency controls across all workflows to cancel outdated runs during PR iterations
  • Removed path filters from push events on main/release branches to ensure cache is always built

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/update-test-weights.sh New script to extract test durations from CI logs and generate weights file
scripts/integration-test-weights.generated.json Generated JSON file containing test duration weights for 84 integration tests
scripts/generate-integration-matrix.sh Updated matrix generation to use weighted bin-packing instead of random distribution
.github/workflows/pull_request_integration_tests.yml Enabled Go module cache and added explicit build cache with security annotations
.github/workflows/workflow_integration_tests_vm.yml Removed path filters from push events, added concurrency control
.github/workflows/pull_request_oats_test.yml Removed path filters from push events, added concurrency control
.github/workflows/pull_request_k8s_integration_tests.yml Removed path filters from push events, added concurrency control
.github/workflows/pull_request_integration_tests_arm.yml Removed path filters from push events, added concurrency control
.github/workflows/pull_request_docker_build_test.yml Added concurrency control
.github/workflows/pull_request.yml Added concurrency control
.github/workflows/markdown-fail-fast.yml Added branches filter to push events, removed paths filter, added concurrency control
.github/workflows/lint_darwin.yml Removed path filters from push events, added concurrency control
.github/workflows/java-agent.yml Removed path filters from push events, added concurrency control
.github/workflows/clang-tidy-check.yml Added concurrency control
.github/workflows/clang-format-check.yml Added concurrency control
.github/workflows/check_gh_actions_security.yml Added concurrency control
.gitattributes Marked integration-test-weights.generated.json as linguist-generated

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/update-test-weights.sh Outdated
Comment thread scripts/update-test-weights.sh Outdated
Comment thread scripts/update-test-weights.sh Outdated
Comment thread .github/workflows/markdown-fail-fast.yml Outdated
skl and others added 4 commits February 16, 2026 19:01
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@skl skl mentioned this pull request Feb 17, 2026
Copy link
Copy Markdown
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@MrAlias MrAlias merged commit 7475bff into open-telemetry:main Feb 17, 2026
80 of 82 checks passed
@skl skl deleted the skl/ci-updates branch February 17, 2026 17:13
@MrAlias MrAlias added this to the v0.6.0 milestone Feb 23, 2026
@MrAlias MrAlias mentioned this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants