install: fix isolated installer hang on tarball integrity failure#29649
Conversation
When a tarball extract task failed (integrity mismatch, libarchive error, etc.) during the resolve phase, `runTasks` fell through the void-callback else-branch without removing the task from `network_dedupe_map` or `task_queue`. The isolated installer's later `enqueuePackageForDownload` then hit `task_queue.found_existing` and returned early without scheduling a fresh network task — the installer waited forever for a callback that was never dispatched. Mirror the HTTP 4xx/5xx branch (#29611): drop the dedupe-map entry before the callback dispatch so the Store.Installer path is covered, and drain the `task_queue` entry in the void-callback fallback.
|
Updated 4:11 PM PT - Apr 23rd, 2026
❌ @robobun, your commit 6728b9f has 1 failures in
🧪 To try this PR locally: bunx bun-pr 29649That installs a local version of the PR into your bun-29649 --bun |
|
Found 3 issues this PR may fix:
🤖 Generated with Claude Code |
WalkthroughAligns tarball extraction failure handling with HTTP error handling by clearing network dedupe state, invoking download-error callbacks when present, and cleaning up task queue entries to prevent deadlocks. Adds a regression test that reproduces and verifies behavior for integrity-mismatch tarballs. Changes
Possibly related PRs
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Thanks. Looking at each:
Staying conservative with just |
As flagged in review: the original assertions (`Integrity check failed`
in stderr, no `1 package installed`, non-zero exit) all passed whether
the fix was in place or not. The hang path still logs the error before
wedging, and when Bun.spawn's timeout (15s) kills the child the exit
code is 128+SIGTERM = 143, which is non-zero.
The load-bearing check is `proc.signalCode === null`. With the fix, bun
exits on its own. On hang, Bun.spawn kills it with SIGTERM.
Also set an explicit `{ timeout: 60_000 }` on the test so bun:test's
default doesn't fire before Bun.spawn's timeout — otherwise we can't
observe `signalCode`.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/cli/install/bun-install-tarball-integrity.test.ts`:
- Around line 495-507: Change the test matrix runner from
describe.concurrent.each to describe.each for the "tarball integrity mismatch
(%s)" block (the top-level describe that wraps the it("should fail (not hang)
when tarball bytes don't match manifest SHA-512") test) so the hoisted/isolated
variants run sequentially; keep the rest of the test (including the timeout:
60_000 and test body) unchanged to avoid CPU/IO contention and timing flakiness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 33ced23d-df21-4154-b1d4-9c86be76618e
📒 Files selected for processing (1)
test/cli/install/bun-install-tarball-integrity.test.ts
There was a problem hiding this comment.
LGTM — extract-failure cleanup now mirrors the existing HTTP-error branches, and the test was strengthened per earlier feedback.
Extended reasoning...
Overview
This PR fixes an isolated-installer hang when tarball extraction fails (e.g., integrity check mismatch). It touches two files: src/install/PackageManager/runTasks.zig (the extract-task failure branch in the resolve-task loop) and test/cli/install/bun-install-tarball-integrity.test.ts (a new parameterized regression test for hoisted/isolated linkers).
The Zig change is a direct mirror of the cleanup pattern already established by #29611 for the two adjacent failure branches in the same function (metadata == null and status_code > 399): remove the network_dedupe_map entry before dispatching the error, and in the void-callback fallback also fetchRemove + deinit the task_queue entry. The if/else → if-continue-fallthrough refactor is mechanical and preserves the original addErrorFmt behavior on the fallback path.
Security risks
None. This is error-path cleanup in the package manager's task loop — no auth, crypto, permissions, or untrusted-input parsing changes. The integrity check itself is unchanged; only the post-failure bookkeeping is fixed.
Level of scrutiny
Moderate. runTasks.zig is a core install loop, but the change is small (~20 LOC of logic), copies an existing pattern verbatim from two sibling branches in the same function, and is well-explained in the PR description. The network_dedupe_map.remove is documented as a no-op for local_tarball tasks, and the task_queue drain matches the other branches exactly, so there's no double-free or ordering concern.
Other factors
I previously left an inline nit that the regression test wouldn't catch the hang (all assertions passed even when the spawn timeout killed the process). The author addressed this in commit 6728b9f by adding expect(proc.signalCode).toBeNull() as the load-bearing assertion, with a clear comment explaining why. No CODEOWNERS apply to src/install/. The bug-hunting system found no issues on the current revision.
…mory' (#29663) `test/js/bun/net/socket.test.ts` "should not leak memory" has been hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents, one past the previous `isWindows ? 3` bound. Same flake shipped with merged #29631 today and has been blocking #29593 through ~15 retriggers. Non-Windows bound stays at `2` — that's where real retention regressions are caught. The Windows bound is already loosened for its shared prototype/structure retention; this just accommodates the one additional retained instance the Win2019 lane is observing. ## Observed - Build [#47600 Win2019 x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456) — `Received: 4, Expected: <= 3` - Build [#47616 Win2019 x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a) — same - Multiple prior PRs: [#29631](#29631), [#29651](#29651), [#29649](#29649), [#29650](#29650), [#29645](#29645), [#29639](#29639) all hit the same pattern ## Scope 3-line diff + comment. If the Win2019 retention count IS a regression worth investigating, that's a separate issue; this change is strictly about not holding other PRs hostage to the shared flake. Co-authored-by: robobun <robobun@bun.sh>
…en-sh#29649) Closes oven-sh#29646. ## Repro Advertise one tarball's SHA-512 in the registry manifest while serving a different tarball's bytes. The isolated-linker install hangs indefinitely (no output past "Created store"); hoisted errors out. ``` error: Integrity check failed for tarball: pkg Resolved, downloaded and extracted [4] error: IntegrityCheckFailed extracting tarball from pkg Clean lockfile: 2 packages - 2 packages in 281.7us Resolved peers [842.2us] Created store [85.1us] <hang> ``` ## Cause Partially fixed by oven-sh#29611 for HTTP 4xx/5xx responses. The extract-task failure branch in `src/install/PackageManager/runTasks.zig` was missed. When the streaming extractor sets `task.err = error.IntegrityCheckFailed` and `task.status = .fail`, the resolve-phase `runTasks` (which passes `onPackageDownloadError = {}`) fell into the void-callback `else` that only logged the error and continued — never removing the entry from `network_dedupe_map` or draining its `task_queue` list. Install-phase `Store.Installer` then called `enqueuePackageForDownload` for the failed package, hit `task_queue.found_existing == true` at `PackageManagerEnqueue.zig:258`, returned early without scheduling a new network task, and the isolated installer waited forever on `pendingTaskCount() == 0`. ## Fix Mirror the 4xx/5xx cleanup symmetrically for extract failures in `runTasks.zig`: - Drop the `network_dedupe_map` entry up front, before the callback branch, so the `Store.Installer` path (which `continue`s from the callback) is also covered. `Store.Installer.onPackageDownloadError` drains `task_queue` itself but doesn't touch the dedupe map. - In the void-callback fallback, also drain the `task_queue` entry, matching the existing HTTP-error branch. `network_dedupe_map.remove` is a no-op for `local_tarball` tasks (they never populate the map). ## Verification Extended `test/cli/install/bun-install-tarball-integrity.test.ts` with a new `describe.each(["hoisted", "isolated"])` block that builds a tarball whose advertised SHA-512 doesn't match the served bytes. Without the fix, `isolated` times out at 5s; with the fix, both linkers surface `Integrity check failed` and exit non-zero. --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
…mory' (oven-sh#29663) `test/js/bun/net/socket.test.ts` "should not leak memory" has been hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents, one past the previous `isWindows ? 3` bound. Same flake shipped with merged oven-sh#29631 today and has been blocking oven-sh#29593 through ~15 retriggers. Non-Windows bound stays at `2` — that's where real retention regressions are caught. The Windows bound is already loosened for its shared prototype/structure retention; this just accommodates the one additional retained instance the Win2019 lane is observing. ## Observed - Build [#47600 Win2019 x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456) — `Received: 4, Expected: <= 3` - Build [#47616 Win2019 x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a) — same - Multiple prior PRs: [oven-sh#29631](oven-sh#29631), [oven-sh#29651](oven-sh#29651), [oven-sh#29649](oven-sh#29649), [oven-sh#29650](oven-sh#29650), [oven-sh#29645](oven-sh#29645), [oven-sh#29639](oven-sh#29639) all hit the same pattern ## Scope 3-line diff + comment. If the Win2019 retention count IS a regression worth investigating, that's a separate issue; this change is strictly about not holding other PRs hostage to the shared flake. Co-authored-by: robobun <robobun@bun.sh>
…0376) Fixes the `bun install` hang reported in #30325 (latest comment — still reproducing on 1.3.13). ## Repro Point `bun install` at an HTTPS registry that accepts TCP but never answers the TLS ClientHello: ```ts // raw TCP server that swallows the ClientHello and never replies net.createServer(s => s.on("data", () => {})).listen(0); ``` ```toml [install] registry = "https://127.0.0.1:<port>/" ``` `bun install` connects, the socket goes ESTABLISHED, and the process blocks in `epoll_wait` forever with no timer armed. This is the state the reporter captured in their Gitea/Kubernetes CI: three ESTABLISHED sockets to the npm CDN, zero rx/tx, 14+ minutes and counting. ## Root cause `HTTPClient.onOpen()` starts the TLS handshake but does not arm the socket's idle timer — the first `setTimeout(socket, 5)` call is in `onWritable()`, which only runs *after* the handshake completes. Freshly-connected sockets inherit `long_timeout = 255` (disabled) from the connecting socket, so a stall anywhere between TCP-connect and handshake-done has no timer at all. The `bun install` main loop then waits forever on `pendingTaskCount() == 0` because the `NetworkTask` callback never fires. The earlier fixes in #29611 / #29649 covered a different hang (4xx/5xx tarball responses not releasing the task slot); they didn't touch this path. ## Fix - Arm the idle timer in `onOpen()` so it covers the TLS handshake. - Wire the short-tick `onTimeout` handler in `HTTPContext.Handler` alongside the existing `onLongTimeout` — `socket.setTimeout(seconds)` picks whichever timer fits the duration, so both must dispatch. - Read the idle-timeout duration from a new `BUN_CONFIG_HTTP_IDLE_TIMEOUT` env var (seconds). Default is 300 — the previous hard-coded 5 minutes — so nothing changes for unconfigured environments except that the handshake phase is now covered. `0` disables the timer (same as `disable_timeout = true`). - Route the experimental h2 client session's `rearmTimeout` through the same value for consistency. ## Verification New test `test/cli/install/bun-install-stalled-tls.test.ts` starts a raw TCP server that accepts connections and never replies, points `bun install` at it over `https://`, sets `BUN_CONFIG_HTTP_IDLE_TIMEOUT=3` / `BUN_CONFIG_HTTP_RETRY_COUNT=0`, and asserts the install fails with a timeout error. ``` # without this change (fail) bun install times out when the registry accepts TCP but never completes the TLS handshake [60004.48ms] ^ this test timed out after 60000ms. # with this change (pass) bun install times out when the registry accepts TCP but never completes the TLS handshake [4483.87ms] ``` `fetch-http2-client.test.ts` (58 tests) and `bun-install-retry.test.ts` still pass. Fixes #30325 --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
…en-sh#30376) Fixes the `bun install` hang reported in oven-sh#30325 (latest comment — still reproducing on 1.3.13). ## Repro Point `bun install` at an HTTPS registry that accepts TCP but never answers the TLS ClientHello: ```ts // raw TCP server that swallows the ClientHello and never replies net.createServer(s => s.on("data", () => {})).listen(0); ``` ```toml [install] registry = "https://127.0.0.1:<port>/" ``` `bun install` connects, the socket goes ESTABLISHED, and the process blocks in `epoll_wait` forever with no timer armed. This is the state the reporter captured in their Gitea/Kubernetes CI: three ESTABLISHED sockets to the npm CDN, zero rx/tx, 14+ minutes and counting. ## Root cause `HTTPClient.onOpen()` starts the TLS handshake but does not arm the socket's idle timer — the first `setTimeout(socket, 5)` call is in `onWritable()`, which only runs *after* the handshake completes. Freshly-connected sockets inherit `long_timeout = 255` (disabled) from the connecting socket, so a stall anywhere between TCP-connect and handshake-done has no timer at all. The `bun install` main loop then waits forever on `pendingTaskCount() == 0` because the `NetworkTask` callback never fires. The earlier fixes in oven-sh#29611 / oven-sh#29649 covered a different hang (4xx/5xx tarball responses not releasing the task slot); they didn't touch this path. ## Fix - Arm the idle timer in `onOpen()` so it covers the TLS handshake. - Wire the short-tick `onTimeout` handler in `HTTPContext.Handler` alongside the existing `onLongTimeout` — `socket.setTimeout(seconds)` picks whichever timer fits the duration, so both must dispatch. - Read the idle-timeout duration from a new `BUN_CONFIG_HTTP_IDLE_TIMEOUT` env var (seconds). Default is 300 — the previous hard-coded 5 minutes — so nothing changes for unconfigured environments except that the handshake phase is now covered. `0` disables the timer (same as `disable_timeout = true`). - Route the experimental h2 client session's `rearmTimeout` through the same value for consistency. ## Verification New test `test/cli/install/bun-install-stalled-tls.test.ts` starts a raw TCP server that accepts connections and never replies, points `bun install` at it over `https://`, sets `BUN_CONFIG_HTTP_IDLE_TIMEOUT=3` / `BUN_CONFIG_HTTP_RETRY_COUNT=0`, and asserts the install fails with a timeout error. ``` # without this change (fail) bun install times out when the registry accepts TCP but never completes the TLS handshake [60004.48ms] ^ this test timed out after 60000ms. # with this change (pass) bun install times out when the registry accepts TCP but never completes the TLS handshake [4483.87ms] ``` `fetch-http2-client.test.ts` (58 tests) and `bun-install-retry.test.ts` still pass. Fixes oven-sh#30325 --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Signed-off-by: Sisyphus <sisyphus@ohos-bun.dev>
…en-sh#29649) Closes oven-sh#29646. ## Repro Advertise one tarball's SHA-512 in the registry manifest while serving a different tarball's bytes. The isolated-linker install hangs indefinitely (no output past "Created store"); hoisted errors out. ``` error: Integrity check failed for tarball: pkg Resolved, downloaded and extracted [4] error: IntegrityCheckFailed extracting tarball from pkg Clean lockfile: 2 packages - 2 packages in 281.7us Resolved peers [842.2us] Created store [85.1us] <hang> ``` ## Cause Partially fixed by oven-sh#29611 for HTTP 4xx/5xx responses. The extract-task failure branch in `src/install/PackageManager/runTasks.zig` was missed. When the streaming extractor sets `task.err = error.IntegrityCheckFailed` and `task.status = .fail`, the resolve-phase `runTasks` (which passes `onPackageDownloadError = {}`) fell into the void-callback `else` that only logged the error and continued — never removing the entry from `network_dedupe_map` or draining its `task_queue` list. Install-phase `Store.Installer` then called `enqueuePackageForDownload` for the failed package, hit `task_queue.found_existing == true` at `PackageManagerEnqueue.zig:258`, returned early without scheduling a new network task, and the isolated installer waited forever on `pendingTaskCount() == 0`. ## Fix Mirror the 4xx/5xx cleanup symmetrically for extract failures in `runTasks.zig`: - Drop the `network_dedupe_map` entry up front, before the callback branch, so the `Store.Installer` path (which `continue`s from the callback) is also covered. `Store.Installer.onPackageDownloadError` drains `task_queue` itself but doesn't touch the dedupe map. - In the void-callback fallback, also drain the `task_queue` entry, matching the existing HTTP-error branch. `network_dedupe_map.remove` is a no-op for `local_tarball` tasks (they never populate the map). ## Verification Extended `test/cli/install/bun-install-tarball-integrity.test.ts` with a new `describe.each(["hoisted", "isolated"])` block that builds a tarball whose advertised SHA-512 doesn't match the served bytes. Without the fix, `isolated` times out at 5s; with the fix, both linkers surface `Integrity check failed` and exit non-zero. --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
…mory' (oven-sh#29663) `test/js/bun/net/socket.test.ts` "should not leak memory" has been hitting `TCPSocket`/`TLSSocket` count `4` on the Windows 2019 agents, one past the previous `isWindows ? 3` bound. Same flake shipped with merged oven-sh#29631 today and has been blocking oven-sh#29593 through ~15 retriggers. Non-Windows bound stays at `2` — that's where real retention regressions are caught. The Windows bound is already loosened for its shared prototype/structure retention; this just accommodates the one additional retained instance the Win2019 lane is observing. ## Observed - Build [#47600 Win2019 x64](https://buildkite.com/bun/bun/builds/47600#019dbc49-ff0c-49c2-afaa-134711815456) — `Received: 4, Expected: <= 3` - Build [#47616 Win2019 x64](https://buildkite.com/bun/bun/builds/47616#019dbc99-fc61-4111-8170-080d9729866a) — same - Multiple prior PRs: [oven-sh#29631](oven-sh#29631), [oven-sh#29651](oven-sh#29651), [oven-sh#29649](oven-sh#29649), [oven-sh#29650](oven-sh#29650), [oven-sh#29645](oven-sh#29645), [oven-sh#29639](oven-sh#29639) all hit the same pattern ## Scope 3-line diff + comment. If the Win2019 retention count IS a regression worth investigating, that's a separate issue; this change is strictly about not holding other PRs hostage to the shared flake. Co-authored-by: robobun <robobun@bun.sh>
Closes #29646.
Repro
Advertise one tarball's SHA-512 in the registry manifest while serving a different tarball's bytes. The isolated-linker install hangs indefinitely (no output past "Created store"); hoisted errors out.
Cause
Partially fixed by #29611 for HTTP 4xx/5xx responses. The extract-task failure branch in
src/install/PackageManager/runTasks.zigwas missed.When the streaming extractor sets
task.err = error.IntegrityCheckFailedandtask.status = .fail, the resolve-phaserunTasks(which passesonPackageDownloadError = {}) fell into the void-callbackelsethat only logged the error and continued — never removing the entry fromnetwork_dedupe_mapor draining itstask_queuelist.Install-phase
Store.Installerthen calledenqueuePackageForDownloadfor the failed package, hittask_queue.found_existing == trueatPackageManagerEnqueue.zig:258, returned early without scheduling a new network task, and the isolated installer waited forever onpendingTaskCount() == 0.Fix
Mirror the 4xx/5xx cleanup symmetrically for extract failures in
runTasks.zig:network_dedupe_mapentry up front, before the callback branch, so theStore.Installerpath (whichcontinues from the callback) is also covered.Store.Installer.onPackageDownloadErrordrainstask_queueitself but doesn't touch the dedupe map.task_queueentry, matching the existing HTTP-error branch.network_dedupe_map.removeis a no-op forlocal_tarballtasks (they never populate the map).Verification
Extended
test/cli/install/bun-install-tarball-integrity.test.tswith a newdescribe.each(["hoisted", "isolated"])block that builds a tarball whose advertised SHA-512 doesn't match the served bytes. Without the fix,isolatedtimes out at 5s; with the fix, both linkers surfaceIntegrity check failedand exit non-zero.