Skip to content

ci: bump linux link box to 2xlarge (64 GiB) for full-LTO#29650

Merged
Jarred-Sumner merged 1 commit into
mainfrom
claude/link-box-2xlarge
Apr 23, 2026
Merged

ci: bump linux link box to 2xlarge (64 GiB) for full-LTO#29650
Jarred-Sumner merged 1 commit into
mainfrom
claude/link-box-2xlarge

Conversation

@Jarred-Sumner

Copy link
Copy Markdown
Collaborator

Since 02fbd62 (#29618, enable LTO for bun-zig.o), linux aarch64 - build-bun OOMs on r8g.xlarge: lld's full-LTO link now ingests ~1.39 GB of bitcode (794 MB JSC + 347 MB libbun-profile + 247 MB bun-zig.o), peak memory crosses the 31.5 GiB available (no swap) about 18 min in, and the kernel OOM-kills lld — sometimes taking the BuildKite agent with it (exit:-1).

This was already failing on the PR's own CI (build 47390) before merge.

target instance bitcode in result
linux x64-glibc r7i.xlarge (4c/32G) 1392 MB pass, 18.98 min
linux aarch64-musl r8g.xlarge (4c/32G) 1310 MB pass, 18.15 min
linux aarch64-glibc r8g.xlarge (4c/32G) 1388 MB Killed, 18.75 min

x64 has slightly more bitcode and survives, so the tipping factor is LLVM's AArch64 backend holding more state during codegen than X86 — not a clang bug, just expected full-LTO scaling. x64 is one bitcode-adding commit away from the same fate.

Bump both linux link boxes to 2xlarge (8 vCPU / 64 GiB). The extra cores also help lld's parallel LTO codegen phase (the single-threaded merge/opt half won't benefit, so expect ~14–15 min rather than half of 19).

Since 02fbd62 (enable LTO for bun-zig.o), the linux-aarch64 link step
OOMs on r8g.xlarge: lld's full-LTO peak crosses the 31.5 GiB available
(no swap) ~18 min in and the kernel kills it. x64 is also at the edge
(18.98 min on r7i.xlarge).

Bump both to 2xlarge (8 vCPU / 64 GiB). The extra cores also speed up
lld's parallel LTO codegen phase.
@robobun

robobun commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator
Updated 4:46 PM PT - Apr 23rd, 2026

❌ Your commit 9d436d59 has 2 failures in Build #47592 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 29650

That installs a local version of the PR into your bun-29650 executable, so you can run:

bun-29650 --bun

@Jarred-Sumner Jarred-Sumner merged commit c077b02 into main Apr 23, 2026
6 of 8 checks passed
@Jarred-Sumner Jarred-Sumner deleted the claude/link-box-2xlarge branch April 23, 2026 21:03
@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d8243859-a9c1-41f8-9959-bfab4bd34deb

📥 Commits

Reviewing files that changed from the base of the PR and between bdae27c and 9d436d5.

📒 Files selected for processing (1)
  • .buildkite/ci.mjs

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.


Walkthrough

Updates EC2 instance sizing in the link-bun build step by increasing instance types from xlarge to 2xlarge variants based on architecture, with an added explanatory comment.

Changes

Cohort / File(s) Summary
Build Configuration
.buildkite/ci.mjs
Updates getLinkBunAgent function to increase Linux EC2 instance types from r8g.xlarge/r7i.xlarge to r8g.2xlarge/r7i.2xlarge based on architecture, with added inline comment.

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward CI instance-size bump to fix LTO link OOMs.

Extended reasoning...

Overview

This PR changes a single line in .buildkite/ci.mjs (getLinkBunAgent), bumping the Linux link-step EC2 instance type from r8g.xlarge/r7i.xlarge (32 GiB) to r8g.2xlarge/r7i.2xlarge (64 GiB), plus an explanatory comment. No logic, control flow, or other agents are touched.

Security risks

None. This is a CI infrastructure config tweak — it only selects a larger AWS instance type for the existing build pipeline. No auth, crypto, permissions, or production code paths are involved.

Level of scrutiny

Low. The change is mechanical and self-contained: a string literal swap for an instance size. The PR description provides clear, well-measured justification (OOM-killed lld during full-LTO link on aarch64 since #29618), and proactively bumps x64 too since it's near the same threshold. The instance family stays the same (r8g/r7i), so AMI/arch compatibility is preserved.

Other factors

No CODEOWNERS entry covers .buildkite/. No prior reviewer comments to address. The bug-hunting system found no issues. The only tradeoff is CI cost, which is an operational decision the author has clearly weighed against the alternative (broken main CI).

Jarred-Sumner pushed a commit that referenced this pull request Apr 24, 2026
…mory' (#29663)

`test/js/bun/net/socket.test.ts` "should not leak memory" has been
hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents,
one past the previous `isWindows ? 3` bound. Same flake shipped with
merged #29631 today and has been blocking #29593 through ~15 retriggers.

Non-Windows bound stays at `2` — that's where real retention regressions
are caught. The Windows bound is already loosened for its shared
prototype/structure retention; this just accommodates the one additional
retained instance the Win2019 lane is observing.

## Observed

- Build [#47600 Win2019
x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456)
— `Received: 4, Expected: <= 3`
- Build [#47616 Win2019
x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a)
— same
- Multiple prior PRs:
[#29631](#29631),
[#29651](#29651),
[#29649](#29649),
[#29650](#29650),
[#29645](#29645),
[#29639](#29639) all hit the same
pattern

## Scope

3-line diff + comment. If the Win2019 retention count IS a regression
worth investigating, that's a separate issue; this change is strictly
about not holding other PRs hostage to the shared flake.

Co-authored-by: robobun <robobun@bun.sh>
structwafel pushed a commit to structwafel/bun that referenced this pull request Apr 25, 2026
Since 02fbd62 (oven-sh#29618, enable LTO for bun-zig.o), `linux aarch64 -
build-bun` OOMs on `r8g.xlarge`: lld's full-LTO link now ingests ~1.39
GB of bitcode (794 MB JSC + 347 MB libbun-profile + 247 MB bun-zig.o),
peak memory crosses the 31.5 GiB available (no swap) about 18 min in,
and the kernel OOM-kills lld — sometimes taking the BuildKite agent with
it (`exit:-1`).

This was already failing on the PR's own CI (build 47390) before merge.

| target | instance | bitcode in | result |
|---|---|---|---|
| linux x64-glibc | r7i.xlarge (4c/32G) | 1392 MB | pass, 18.98 min |
| linux aarch64-musl | r8g.xlarge (4c/32G) | 1310 MB | pass, 18.15 min |
| **linux aarch64-glibc** | r8g.xlarge (4c/32G) | 1388 MB | **Killed,
18.75 min** |

x64 has slightly *more* bitcode and survives, so the tipping factor is
LLVM's AArch64 backend holding more state during codegen than X86 — not
a clang bug, just expected full-LTO scaling. x64 is one bitcode-adding
commit away from the same fate.

Bump both linux link boxes to `2xlarge` (8 vCPU / 64 GiB). The extra
cores also help lld's parallel LTO codegen phase (the single-threaded
merge/opt half won't benefit, so expect ~14–15 min rather than half of
19).

Co-authored-by: root <root@ip-10-0-2-234.us-west-2.compute.internal>
structwafel pushed a commit to structwafel/bun that referenced this pull request Apr 25, 2026
…mory' (oven-sh#29663)

`test/js/bun/net/socket.test.ts` "should not leak memory" has been
hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents,
one past the previous `isWindows ? 3` bound. Same flake shipped with
merged oven-sh#29631 today and has been blocking oven-sh#29593 through ~15 retriggers.

Non-Windows bound stays at `2` — that's where real retention regressions
are caught. The Windows bound is already loosened for its shared
prototype/structure retention; this just accommodates the one additional
retained instance the Win2019 lane is observing.

## Observed

- Build [#47600 Win2019
x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456)
— `Received: 4, Expected: <= 3`
- Build [#47616 Win2019
x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a)
— same
- Multiple prior PRs:
[oven-sh#29631](oven-sh#29631),
[oven-sh#29651](oven-sh#29651),
[oven-sh#29649](oven-sh#29649),
[oven-sh#29650](oven-sh#29650),
[oven-sh#29645](oven-sh#29645),
[oven-sh#29639](oven-sh#29639) all hit the same
pattern

## Scope

3-line diff + comment. If the Win2019 retention count IS a regression
worth investigating, that's a separate issue; this change is strictly
about not holding other PRs hostage to the shared flake.

Co-authored-by: robobun <robobun@bun.sh>
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
Since 02fbd62 (oven-sh#29618, enable LTO for bun-zig.o), `linux aarch64 -
build-bun` OOMs on `r8g.xlarge`: lld's full-LTO link now ingests ~1.39
GB of bitcode (794 MB JSC + 347 MB libbun-profile + 247 MB bun-zig.o),
peak memory crosses the 31.5 GiB available (no swap) about 18 min in,
and the kernel OOM-kills lld — sometimes taking the BuildKite agent with
it (`exit:-1`).

This was already failing on the PR's own CI (build 47390) before merge.

| target | instance | bitcode in | result |
|---|---|---|---|
| linux x64-glibc | r7i.xlarge (4c/32G) | 1392 MB | pass, 18.98 min |
| linux aarch64-musl | r8g.xlarge (4c/32G) | 1310 MB | pass, 18.15 min |
| **linux aarch64-glibc** | r8g.xlarge (4c/32G) | 1388 MB | **Killed,
18.75 min** |

x64 has slightly *more* bitcode and survives, so the tipping factor is
LLVM's AArch64 backend holding more state during codegen than X86 — not
a clang bug, just expected full-LTO scaling. x64 is one bitcode-adding
commit away from the same fate.

Bump both linux link boxes to `2xlarge` (8 vCPU / 64 GiB). The extra
cores also help lld's parallel LTO codegen phase (the single-threaded
merge/opt half won't benefit, so expect ~14–15 min rather than half of
19).

Co-authored-by: root <root@ip-10-0-2-234.us-west-2.compute.internal>
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…mory' (oven-sh#29663)

`test/js/bun/net/socket.test.ts` "should not leak memory" has been
hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents,
one past the previous `isWindows ? 3` bound. Same flake shipped with
merged oven-sh#29631 today and has been blocking oven-sh#29593 through ~15 retriggers.

Non-Windows bound stays at `2` — that's where real retention regressions
are caught. The Windows bound is already loosened for its shared
prototype/structure retention; this just accommodates the one additional
retained instance the Win2019 lane is observing.

## Observed

- Build [#47600 Win2019
x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456)
— `Received: 4, Expected: <= 3`
- Build [#47616 Win2019
x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a)
— same
- Multiple prior PRs:
[oven-sh#29631](oven-sh#29631),
[oven-sh#29651](oven-sh#29651),
[oven-sh#29649](oven-sh#29649),
[oven-sh#29650](oven-sh#29650),
[oven-sh#29645](oven-sh#29645),
[oven-sh#29639](oven-sh#29639) all hit the same
pattern

## Scope

3-line diff + comment. If the Win2019 retention count IS a regression
worth investigating, that's a separate issue; this change is strictly
about not holding other PRs hostage to the shared flake.

Co-authored-by: robobun <robobun@bun.sh>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants