ci: bump linux link box to 2xlarge (64 GiB) for full-LTO#29650
Conversation
Since 02fbd62 (enable LTO for bun-zig.o), the linux-aarch64 link step OOMs on r8g.xlarge: lld's full-LTO peak crosses the 31.5 GiB available (no swap) ~18 min in and the kernel kills it. x64 is also at the edge (18.98 min on r7i.xlarge). Bump both to 2xlarge (8 vCPU / 64 GiB). The extra cores also speed up lld's parallel LTO codegen phase.
|
Updated 4:46 PM PT - Apr 23rd, 2026
❌ Your commit
🧪 To try this PR locally: bunx bun-pr 29650That installs a local version of the PR into your bun-29650 --bun |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
Disabled knowledge base sources:
WalkthroughUpdates EC2 instance sizing in the link-bun build step by increasing instance types from xlarge to 2xlarge variants based on architecture, with an added explanatory comment. Changes
Comment |
There was a problem hiding this comment.
LGTM — straightforward CI instance-size bump to fix LTO link OOMs.
Extended reasoning...
Overview
This PR changes a single line in .buildkite/ci.mjs (getLinkBunAgent), bumping the Linux link-step EC2 instance type from r8g.xlarge/r7i.xlarge (32 GiB) to r8g.2xlarge/r7i.2xlarge (64 GiB), plus an explanatory comment. No logic, control flow, or other agents are touched.
Security risks
None. This is a CI infrastructure config tweak — it only selects a larger AWS instance type for the existing build pipeline. No auth, crypto, permissions, or production code paths are involved.
Level of scrutiny
Low. The change is mechanical and self-contained: a string literal swap for an instance size. The PR description provides clear, well-measured justification (OOM-killed lld during full-LTO link on aarch64 since #29618), and proactively bumps x64 too since it's near the same threshold. The instance family stays the same (r8g/r7i), so AMI/arch compatibility is preserved.
Other factors
No CODEOWNERS entry covers .buildkite/. No prior reviewer comments to address. The bug-hunting system found no issues. The only tradeoff is CI cost, which is an operational decision the author has clearly weighed against the alternative (broken main CI).
…mory' (#29663) `test/js/bun/net/socket.test.ts` "should not leak memory" has been hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents, one past the previous `isWindows ? 3` bound. Same flake shipped with merged #29631 today and has been blocking #29593 through ~15 retriggers. Non-Windows bound stays at `2` — that's where real retention regressions are caught. The Windows bound is already loosened for its shared prototype/structure retention; this just accommodates the one additional retained instance the Win2019 lane is observing. ## Observed - Build [#47600 Win2019 x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456) — `Received: 4, Expected: <= 3` - Build [#47616 Win2019 x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a) — same - Multiple prior PRs: [#29631](#29631), [#29651](#29651), [#29649](#29649), [#29650](#29650), [#29645](#29645), [#29639](#29639) all hit the same pattern ## Scope 3-line diff + comment. If the Win2019 retention count IS a regression worth investigating, that's a separate issue; this change is strictly about not holding other PRs hostage to the shared flake. Co-authored-by: robobun <robobun@bun.sh>
Since 02fbd62 (oven-sh#29618, enable LTO for bun-zig.o), `linux aarch64 - build-bun` OOMs on `r8g.xlarge`: lld's full-LTO link now ingests ~1.39 GB of bitcode (794 MB JSC + 347 MB libbun-profile + 247 MB bun-zig.o), peak memory crosses the 31.5 GiB available (no swap) about 18 min in, and the kernel OOM-kills lld — sometimes taking the BuildKite agent with it (`exit:-1`). This was already failing on the PR's own CI (build 47390) before merge. | target | instance | bitcode in | result | |---|---|---|---| | linux x64-glibc | r7i.xlarge (4c/32G) | 1392 MB | pass, 18.98 min | | linux aarch64-musl | r8g.xlarge (4c/32G) | 1310 MB | pass, 18.15 min | | **linux aarch64-glibc** | r8g.xlarge (4c/32G) | 1388 MB | **Killed, 18.75 min** | x64 has slightly *more* bitcode and survives, so the tipping factor is LLVM's AArch64 backend holding more state during codegen than X86 — not a clang bug, just expected full-LTO scaling. x64 is one bitcode-adding commit away from the same fate. Bump both linux link boxes to `2xlarge` (8 vCPU / 64 GiB). The extra cores also help lld's parallel LTO codegen phase (the single-threaded merge/opt half won't benefit, so expect ~14–15 min rather than half of 19). Co-authored-by: root <root@ip-10-0-2-234.us-west-2.compute.internal>
…mory' (oven-sh#29663) `test/js/bun/net/socket.test.ts` "should not leak memory" has been hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents, one past the previous `isWindows ? 3` bound. Same flake shipped with merged oven-sh#29631 today and has been blocking oven-sh#29593 through ~15 retriggers. Non-Windows bound stays at `2` — that's where real retention regressions are caught. The Windows bound is already loosened for its shared prototype/structure retention; this just accommodates the one additional retained instance the Win2019 lane is observing. ## Observed - Build [#47600 Win2019 x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456) — `Received: 4, Expected: <= 3` - Build [#47616 Win2019 x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a) — same - Multiple prior PRs: [oven-sh#29631](oven-sh#29631), [oven-sh#29651](oven-sh#29651), [oven-sh#29649](oven-sh#29649), [oven-sh#29650](oven-sh#29650), [oven-sh#29645](oven-sh#29645), [oven-sh#29639](oven-sh#29639) all hit the same pattern ## Scope 3-line diff + comment. If the Win2019 retention count IS a regression worth investigating, that's a separate issue; this change is strictly about not holding other PRs hostage to the shared flake. Co-authored-by: robobun <robobun@bun.sh>
Since 02fbd62 (oven-sh#29618, enable LTO for bun-zig.o), `linux aarch64 - build-bun` OOMs on `r8g.xlarge`: lld's full-LTO link now ingests ~1.39 GB of bitcode (794 MB JSC + 347 MB libbun-profile + 247 MB bun-zig.o), peak memory crosses the 31.5 GiB available (no swap) about 18 min in, and the kernel OOM-kills lld — sometimes taking the BuildKite agent with it (`exit:-1`). This was already failing on the PR's own CI (build 47390) before merge. | target | instance | bitcode in | result | |---|---|---|---| | linux x64-glibc | r7i.xlarge (4c/32G) | 1392 MB | pass, 18.98 min | | linux aarch64-musl | r8g.xlarge (4c/32G) | 1310 MB | pass, 18.15 min | | **linux aarch64-glibc** | r8g.xlarge (4c/32G) | 1388 MB | **Killed, 18.75 min** | x64 has slightly *more* bitcode and survives, so the tipping factor is LLVM's AArch64 backend holding more state during codegen than X86 — not a clang bug, just expected full-LTO scaling. x64 is one bitcode-adding commit away from the same fate. Bump both linux link boxes to `2xlarge` (8 vCPU / 64 GiB). The extra cores also help lld's parallel LTO codegen phase (the single-threaded merge/opt half won't benefit, so expect ~14–15 min rather than half of 19). Co-authored-by: root <root@ip-10-0-2-234.us-west-2.compute.internal>
…mory' (oven-sh#29663) `test/js/bun/net/socket.test.ts` "should not leak memory" has been hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents, one past the previous `isWindows ? 3` bound. Same flake shipped with merged oven-sh#29631 today and has been blocking oven-sh#29593 through ~15 retriggers. Non-Windows bound stays at `2` — that's where real retention regressions are caught. The Windows bound is already loosened for its shared prototype/structure retention; this just accommodates the one additional retained instance the Win2019 lane is observing. ## Observed - Build [#47600 Win2019 x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456) — `Received: 4, Expected: <= 3` - Build [#47616 Win2019 x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a) — same - Multiple prior PRs: [oven-sh#29631](oven-sh#29631), [oven-sh#29651](oven-sh#29651), [oven-sh#29649](oven-sh#29649), [oven-sh#29650](oven-sh#29650), [oven-sh#29645](oven-sh#29645), [oven-sh#29639](oven-sh#29639) all hit the same pattern ## Scope 3-line diff + comment. If the Win2019 retention count IS a regression worth investigating, that's a separate issue; this change is strictly about not holding other PRs hostage to the shared flake. Co-authored-by: robobun <robobun@bun.sh>
Since 02fbd62 (#29618, enable LTO for bun-zig.o),
linux aarch64 - build-bunOOMs onr8g.xlarge: lld's full-LTO link now ingests ~1.39 GB of bitcode (794 MB JSC + 347 MB libbun-profile + 247 MB bun-zig.o), peak memory crosses the 31.5 GiB available (no swap) about 18 min in, and the kernel OOM-kills lld — sometimes taking the BuildKite agent with it (exit:-1).This was already failing on the PR's own CI (build 47390) before merge.
x64 has slightly more bitcode and survives, so the tipping factor is LLVM's AArch64 backend holding more state during codegen than X86 — not a clang bug, just expected full-LTO scaling. x64 is one bitcode-adding commit away from the same fate.
Bump both linux link boxes to
2xlarge(8 vCPU / 64 GiB). The extra cores also help lld's parallel LTO codegen phase (the single-threaded merge/opt half won't benefit, so expect ~14–15 min rather than half of 19).