Exit unsettled top-level await instead of hanging / busy-looping#29549
Exit unsettled top-level await instead of hanging / busy-looping#29549robobun wants to merge 33 commits into
Conversation
|
Updated 4:06 PM PT - May 5th, 2026
❌ @robobun, your commit ab4b3b4 has 1 failures in
🧪 To try this PR locally: bunx bun-pr 29549That installs a local version of the PR into your bun-29549 --bun |
|
Found 2 issues this PR may fix:
🤖 Generated with Claude Code |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds event-loop liveness checks and a new wait variant so awaiting a promise can exit early when the loop has no remaining handle work; integrates this into Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/bun.js/VirtualMachine.zig (1)
2145-2167:⚠️ Potential issue | 🟠 MajorDon't treat a still-pending preload as success.
waitForPromiseOrLoopExit()can now return withpromise.status() == .pending, butloadPreloads()still only returns the promise on.rejected. In that case the preload silently falls through and the main entry point starts loading even though the preload never finished.Suggested fix
- if (promise.status() == .rejected) - return promise; + switch (promise.status()) { + .fulfilled => {}, + else => return promise, + }Also applies to: 2169-2170
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/VirtualMachine.zig` around lines 2145 - 2167, The code assumes waitForPromiseOrLoopExit(.{ .internal = promise }) always yields resolved/rejected, but it can return while promise.status() == .pending causing loadPreloads()/pending_internal_promise to silently fall through; update the non-watcher branch to check this pending_internal_promise?.status() after eventLoop.waitForPromiseOrLoopExit and treat .pending like an error or continue waiting (e.g., loop until status != .pending or explicitly convert pending to .rejected), and mirror the same guard/handling in loadPreloads() (and the referenced lines ~2169-2170) so the main entry point never proceeds when a preload remains pending; reference functions/fields: eventLoop.waitForPromiseOrLoopExit, pending_internal_promise, loadPreloads, isWatcherEnabled, eventLoop.performGC.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/regression/issue/29546.test.ts`:
- Around line 24-29: Remove the brittle assertions that require an empty stderr
in the subprocess regression tests: delete any
expect(stripAsanWarning(stderr)).toBe("") (or expect(stderr).toBe("")) checks in
the tests that spawn bunExe() with bunEnv and instead rely on stdout and
exitCode as the regression signal; search for uses of stripAsanWarning and
remove the empty-stderr expectations (including the other occurrences in this
test file).
---
Outside diff comments:
In `@src/bun.js/VirtualMachine.zig`:
- Around line 2145-2167: The code assumes waitForPromiseOrLoopExit(.{ .internal
= promise }) always yields resolved/rejected, but it can return while
promise.status() == .pending causing loadPreloads()/pending_internal_promise to
silently fall through; update the non-watcher branch to check this
pending_internal_promise?.status() after eventLoop.waitForPromiseOrLoopExit and
treat .pending like an error or continue waiting (e.g., loop until status !=
.pending or explicitly convert pending to .rejected), and mirror the same
guard/handling in loadPreloads() (and the referenced lines ~2169-2170) so the
main entry point never proceeds when a preload remains pending; reference
functions/fields: eventLoop.waitForPromiseOrLoopExit, pending_internal_promise,
loadPreloads, isWatcherEnabled, eventLoop.performGC.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 808f2194-8307-4e09-ba3f-9936db83c37b
📒 Files selected for processing (3)
src/bun.js/VirtualMachine.zigsrc/bun.js/event_loop.zigtest/regression/issue/29546.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/regression/issue/29546.test.ts`:
- Around line 4-25: Remove the long file-level bug-history block at the top of
test/regression/issue/29546.test.ts and replace it with only the issue URL and a
one-line summary; keep the per-test rationale comments that explain why the
timings/fixtures detect the regression. Concretely, trim out the multi-paragraph
explanation referencing loadEntryPoint, EventLoop.waitForPromiseOrLoopExit,
waitForPromise and Expect.toThrow so the header contains just the issue link and
a concise summary, while leaving any inline test comments that describe test
design and timing choices intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 25b27b14-7959-457b-b514-43cc96534d8e
📒 Files selected for processing (1)
test/regression/issue/29546.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/event_loop.zig`:
- Around line 591-600: The hasAnyHandleWork function is checking
vm.event_loop_handle.?.isActive() which on Windows references the libuv loop
instead of the uSockets loop; update that check to use
this.usocketsLoop().isActive() (preserving the optional/null-safety semantics
used elsewhere) so the early-exit condition used by waitForPromiseOrLoopExit
correctly observes uws I/O activity; locate hasAnyHandleWork and replace the
vm.event_loop_handle.?.isActive() term with an equivalent
this.usocketsLoop().isActive() call consistent with
autoTick()/tickPossiblyForever()/autoTickActive().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8b9bb2d9-017c-4b39-ba48-99d5c77dd809
📒 Files selected for processing (1)
src/bun.js/event_loop.zig
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/bun.js/event_loop.zig (1)
591-600:⚠️ Potential issue | 🟠 MajorUse
this.usocketsLoop().isActive()instead ofvm.event_loop_handle.?.isActive()on Windows.On Windows,
vm.event_loop_handleis the libuv loop whileusocketsLoop()returnsthis.uws_loop(uSockets). Other functions in this file (autoTick(),tickPossiblyForever(),autoTickActive()) useusocketsLoop()for cross-platform consistency.Checking the wrong loop could cause the early-exit condition in
waitForPromiseOrLoopExit()to trigger incorrectly on Windows when there's still pending work in the uws I/O loop.🔧 Suggested fix
pub fn hasAnyHandleWork(this: *EventLoop) bool { const vm = this.virtual_machine; - return vm.event_loop_handle.?.isActive() or + return this.usocketsLoop().isActive() or vm.active_tasks > 0 or this.tasks.count > 0 or this.hasPendingRefs() or🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/event_loop.zig` around lines 591 - 600, Replace the libuv loop check with the uSockets loop check in hasAnyHandleWork: currently it calls vm.event_loop_handle.?.isActive(), which on Windows inspects the wrong loop; change it to use this.usocketsLoop().isActive() so the function (hasAnyHandleWork) correctly reflects uSockets activity (matching other functions like autoTick, tickPossiblyForever, autoTickActive) and avoids premature exit in waitForPromiseOrLoopExit().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/bun.js/event_loop.zig`:
- Around line 591-600: Replace the libuv loop check with the uSockets loop check
in hasAnyHandleWork: currently it calls vm.event_loop_handle.?.isActive(), which
on Windows inspects the wrong loop; change it to use
this.usocketsLoop().isActive() so the function (hasAnyHandleWork) correctly
reflects uSockets activity (matching other functions like autoTick,
tickPossiblyForever, autoTickActive) and avoids premature exit in
waitForPromiseOrLoopExit().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e8fba4f3-bb15-4626-9123-135496f56c50
📒 Files selected for processing (2)
src/bun.js/event_loop.zigtest/regression/issue/29546.test.ts
There was a problem hiding this comment.
The fix logic is sound and all previously raised concerns have been addressed, but the event-loop liveness semantics (especially on Windows where is libuv while is uSockets) are subtle enough that a human familiar with Bun's Windows port should give this a final pass before merge.
Extended reasoning...
Overview
PR adds and to , then routes (both watcher and non-watcher branches) through the new function. Also adds between and the liveness check in the watcher branch to prevent silent microtask loss. A regression test file with 5 targeted scenarios is included.
Security Risks
None — this is pure event-loop scheduling logic with no auth, crypto, or I/O boundary changes.
Level of Scrutiny
High. This touches the event loop tick loop, which is among the most critical code paths in the runtime. Correctness failures here manifest as silent data loss (TLA continuations dropped, module init skipped) rather than hard crashes, making them difficult to detect in CI.
Other Factors
All inline concerns raised in previous review passes are marked resolved: the / bug was fixed via the new , the worker TLA change was correctly reverted, is checked, the microtask drain is present, and assertion ordering in tests was fixed. The one remaining open question — whether (libuv on Windows) is the right liveness proxy vs (uSockets) — was raised by CodeRabbit and marked resolved without a visible code change. That distinction is harmless on POSIX (same object) but could produce an incorrect early-exit on Windows if a uSockets handle is active while libuv is not. The scope decision to leave and on the old path is explicitly documented and reasonable.
There was a problem hiding this comment.
All previously identified issues have been addressed, but this PR introduces novel event loop control flow in critical infrastructure — a human should verify the final state.
Extended reasoning...
Overview
This PR adds two new event loop helpers ( and ) to fix POSIX busy-looping at 100% CPU and Windows infinite hangs when top-level await is held only by an unref'd timer. The changes touch and , with a regression test file added.
Security risks
None. This is event loop scheduling logic with no auth, crypto, or permission implications.
Level of scrutiny
High. The event loop is among the most critical code paths in Bun. The fix introduces a new liveness predicate () and a new wait variant (), both with subtle correctness requirements: they must be conservative (never exit while real work is pending) while also being aggressive enough to exit when only unref'd work remains. Multiple real bugs were found and addressed during this review cycle, indicating the code needed careful iteration.
Other factors
All of my previous inline bug reports (7 in total) were addressed in commit 36da1f7 — fixing the short-circuit issue, adding to the liveness check, adding microtask drains in both the non-watcher and watcher branches, and reverting the unneeded worker change. The test suite is comprehensive with 5 tests covering the regression, the keep-alive guard, never-resolving TLA, unhandled-rejection-during-wait, and the subtle unhandledRejection-handler-resolves-TLA case. No new bugs were surfaced by the bug hunting system. The PR is well-reasoned and in good shape; a human reviewer should verify the final implementation before merge.
There was a problem hiding this comment.
All earlier review feedback has been addressed and I didn't find anything new this round, but this changes the core TLA wait/exit semantics in the event loop (and the prior attempt #27215 regressed), so it's worth a human pass before merge.
Extended reasoning...
Overview
This PR adds EventLoop.hasAnyHandleWork() and EventLoop.waitForPromiseOrLoopExit() in src/bun.js/event_loop.zig, and routes loadEntryPoint (both watcher and non-watcher branches) in src/bun.js/VirtualMachine.zig through the new wait variant. The net effect is that an unsettled top-level await with no ref'd handles now exits cleanly instead of busy-looping (POSIX) or hanging (Windows). A new regression test file with five subprocess tests pins the behavior.
Over six prior review rounds I flagged: (1) the original isEventLoopAlive() predicate short-circuiting on unhandled_error_counter, (2) an unintended change to waitForPromiseWithTermination, (3) missing concurrent_tasks in the liveness check, (4) a missing microtask drain between autoTick() and the liveness break, and (5) the same drain missing from the watcher branch. All of these have been addressed in the current diff, and the bug-hunting pass on the latest revision found nothing new.
Security risks
None. This is purely event-loop scheduling/exit logic; no auth, crypto, parsing, or untrusted-input surfaces are touched.
Level of scrutiny
High. event_loop.zig and VirtualMachine.loadEntryPoint are on the hot path for every bun run / bun foo.ts invocation, and this PR changes when the process exits — a user-visible semantic. The previous attempt at this fix (#27215) caused test-runner regressions, and this PR went through several correctness iterations before stabilizing. The liveness predicate aggregates seven signals across two platforms; subtle mis-ordering or a missed signal can silently drop work. That warrants a maintainer's eyes, not an auto-approve.
Other factors
waitForPromiseitself is deliberately unchanged, andloadEntryPointForTestRunner/loadPreloadskeep the old contract (the PR description scopes the test-runner case as a separate bug). That scoping decision is reasonable but worth a maintainer ack.- CodeRabbit raised
usocketsLoop()vsevent_loop_handleon Windows; the thread is marked resolved but the code still usesevent_loop_handle(matchingisEventLoopAliveExcludingImmediates). Likely correct, but a Windows-familiar reviewer should confirm. - CI shows two failures (
in-process-cron,bun-install) that the latest commit message attributes to pre-existing flakes; they don't appear related to this change. - The microtask drain is now gated behind
\!hasAnyHandleWork()(only runs on the about-to-break iteration), which keeps the steady-state loop identical towaitForPromise— a nice property for perf, but also means the drain only covers the one known edge case rather than every iteration.
An unref'd Bun timer (e.g. AbortSignal.timeout()) awaited at the top
level with no ref'd handle keeping the loop alive hit two distinct
event-loop bugs:
POSIX: autoTick() took the tickWithoutIdle() branch (zero-timeout
poll) and drainTimers() fired the timer eventually, but only
after a tight busy loop burning ~100% CPU for the full wait.
Windows: tickWithoutIdle -> us_loop_pump -> uv_run(UV_RUN_NOWAIT).
uv_run early-returns when uv__loop_alive() is false, so the
unref'd uv timer that would drain Bun's heap never fired.
The process hung forever.
Add EventLoop.waitForPromiseOrLoopExit: like waitForPromise, but bails
when isEventLoopAlive() goes false. Route the three TLA entry paths
(loadEntryPoint, loadEntryPointForTestRunner, loadPreloads) plus the
watcher variants through it, and give waitForPromiseWithTermination the
same break so workers with unsettled TLA exit cleanly too.
waitForPromise itself is unchanged — its contract ('returns with the
promise resolved') is load-bearing for callers like Expect's toThrow.
Matches Node.js, which also exits an unsettled top-level await without
waiting on unref'd handles.
Fixes #29546
Address two review findings: 1. (coderabbit, major) loadPreloads + loadEntryPointForTestRunner could now return with pending_internal_promise, silently advancing to the next preload / reporting 0 tests instead of waiting. Revert those paths — preloads/testrunner keep the old 'wait until resolved' contract. Only loadEntryPoint (the 'bun run <script>' path) gets the loop-exit escape. 2. (claude, critical) isEventLoopAlive() short-circuits on unhandled_error_counter != 0. A side-path unhandled rejection during a TLA wait would cause waitForPromiseOrLoopExit to bail while the actual work (setTimeout, fetch, etc.) was still in flight — the continuation after 'await' would never run. Replace the check with a new EventLoop.hasAnyHandleWork() helper that only considers handle liveness, not error state. waitForPromise callers (Expect, bundler, html_rewriter, macros, REPL) are untouched — their 'returns with promise resolved' contract is still load-bearing. Also revert the waitForPromiseWithTermination break from the prior commit; workers aren't exercising the reporter's bug. Adds a regression test pinning the mid-TLA-unhandled-rejection behavior so a future someone doesn't reintroduce the short-circuit.
The fix function is EventLoop.waitForPromiseOrLoopExit; loadEntryPoint just calls it. waitForPromise itself is unchanged.
A hang regression would produce exitCode=143 (test-timeout SIGTERM) and elapsed>=30_000. Asserting exit code first short-circuits and the diff reads as "expected 0, received 143" — technically a failure but useless for spotting the actual regression (the timer). Swap the order on the two timing-sensitive tests so the elapsed check fires first.
Closes a narrow race where a thread pushes to concurrent_tasks without a corresponding ref (e.g. hot_reloader.zig:278, where the Watcher thread has no KeepAlive of its own and relies on the loop's forever_timer for liveness). Between tick() draining concurrent_tasks and the check, a fresh push could be missed if all other liveness signals happened to be zero. concurrent_tasks.isEmpty() is a single atomic acquire-load on the back pointer (unbounded_queue.zig:165), safe to call from the JS thread without locking. Changes the hasAnyHandleWork receiver from *const to *EventLoop because isEmpty takes *Self.
autoTick ends with handleRejectedPromises(), which runs user JS. In
.bun unhandled-rejection mode with a registered handler, the path
through Bun__handleUnhandledRejection returns WITHOUT a 'defer
drainMicrotasks' (unlike .none / .warn / .strict). If that handler
resolves the TLA-awaited promise, the continuation is queued as a JSC
microtask that hasn't run yet when autoTick returns.
hasAnyHandleWork() inspects uv/uws handles, task queues, refs, and
immediates — none of which see JSC microtasks. Without a drain between
autoTick and the liveness check, the loop breaks with the promise
pending, silently dropping the continuation.
Added a regression test that pins this: a TLA awaits a promise that is
only resolved via process.on('unhandledRejection') firing inside
autoTick. Without the drain it hits the break and exits with empty
stdout; with the drain it prints 'got=handled' and exits 0.
Mirrors the fix in 71dfe8b for the non-watcher path. autoTick's handleRejectedPromises() can run a user process.on('unhandledRejection') handler that resolves the TLA promise and queues a continuation microtask; hasAnyHandleWork inspects handles/tasks/refs/immediates but not JSC microtasks, so the break could fire before the continuation runs. In practice the outer bun.js.zig watcher loop (while isEventLoopAlive()) drains microtasks after this returns, so the continuation survives even without the drain here. But the symmetry with waitForPromiseOrLoopExit is worth preserving — every caller of hasAnyHandleWork now drains first, so the assumption 'if the check says no work, there is genuinely no pending JS to run' holds at the call site rather than relying on downstream machinery.
…ers progress
The in-tick HotReloadTask path drains microtasks via tick()'s
else-branch after the task returns; the new module's loader chain
(fetch → link → evaluate) runs inline and the module body is
evaluated before reload() returns. Callers *outside* tick() don't get
that drain:
- reportExceptionInHotReloadedModuleIfNeeded()'s stranded-flag
consumer (5215594)
- the pre-existing bun.js.zig initial-load rejection path
They would return to the main loop with the new module still unset,
tickPossiblyForever() would then block in loop.tick() before
this.tick() ever ran, and the reload would only take effect on the
next loop wakeup (another save, or forever_timer firing in up to 4
minutes). User-visible symptom: observably identical to 'save silently
dropped, recovers on next save' for the two narrow stranded-flag
triggers 5215594 was meant to close.
Drain once after reloadEntryPoint() returns so every call site —
in-tick, reportException…, bun.js.zig — gets identical semantics.
There was a problem hiding this comment.
No issues found in this revision, but this reworks core event-loop liveness + TLA exit semantics and the --hot reload state machine (with Windows-specific direct reads of libuv internals), so it needs a human sign-off — also worth confirming the Windows test-http-should-emit-close-when-connection-is-aborted.ts timeouts in Build #51503 are pre-existing flakes and not related to the loop changes here.
Extended reasoning...
Overview
Touches three files: src/jsc/event_loop.zig (adds hasAnyHandleWork, hasAnyHandleWorkIgnoringForeverTimer, waitForPromiseOrLoopExit), src/jsc/VirtualMachine.zig (routes loadEntryPoint's TLA wait through the new helper, reworks reload() and reportExceptionInHotReloadedModuleIfNeeded for the abandoned-TLA case under --hot, adds a post-reloadEntryPoint microtask drain), and a new 256-line regression test. The change alters when a Bun process decides to exit vs. keep waiting on a top-level await — core process-lifecycle semantics.
Security risks
None identified. No auth/crypto/permissions surface; the change is internal event-loop scheduling. The behavioral change (exit instead of wait on unref'd handles) matches Node.js semantics and is not a security boundary.
Level of scrutiny
High. This is production-critical runtime code: the event loop's liveness predicate and the entry-point promise-wait loop affect every bun run / bun -e invocation. The Windows branch of hasAnyHandleWorkIgnoringForeverTimer reads libuv internal struct fields (active_handles, active_reqs.count, pending_reqs_tail, endgame_handles) directly and applies a heuristic subtraction for forever_timer. The --hot reload changes alter when the module registry is torn down vs. deferred. A prior attempt at this same fix (#27215) caused test-runner regressions, and this PR went through ~15 rounds of automated review correcting subtle issues (the unhandled_error_counter short-circuit, worker-TLA revert, Windows forever_timer masking, zombie-continuation ordering, stranded hot_reload_deferred, out-of-tick() microtask drain). All of those are now resolved, but the iteration count itself signals the subtlety.
Other factors
- All prior inline review comments are marked resolved and the latest commit (e86f538) addresses the last one.
- Build #51503 (one commit back, 5215594) shows
test-http-should-emit-close-when-connection-is-aborted.tstiming out on three Windows lanes plus a Windows build-agent infra failure.loadEntryPointForTestRunnerwas deliberately left unchanged so this is plausibly a pre-existing flake, but given the PR does change Windows event-loop liveness it's worth a human confirming before merge. - Good test coverage for the specific scenarios fixed (six regression tests including a
--hotround-trip test), and the PR description documents re-running the related TLA/timer regression suites.
This is not a candidate for auto-approval — it's a substantive runtime behavior change that warrants maintainer review.
There was a problem hiding this comment.
No bugs found in the current revision, but this reworks core event-loop liveness + the --hot reload deferral path (including Windows-specific reads of libuv loop internals) across enough subtle iterations that a human pass on the final shape is warranted.
Extended reasoning...
Overview
This PR changes how Bun handles unsettled top-level await in the main entry point, plus a substantial rework of the --hot reload deferral path that fell out of it. Touched surfaces:
src/jsc/event_loop.zig: newhasAnyHandleWork(),hasAnyHandleWorkIgnoringForeverTimer()(Windows branch readsuv_loop_tinternal fields directly:active_handles,active_reqs.count,pending_reqs_tail,endgame_handles), andwaitForPromiseOrLoopExit().src/jsc/VirtualMachine.zig:loadEntryPointwatcher-loop break,reload().pendingarm restructured (drain → re-switch → liveness gate), a new.pendingconsumer inreportExceptionInHotReloadedModuleIfNeeded, and a post-reloadEntryPointmicrotask drain.- New regression test with 6 cases including a
--hotround-trip test.
Security risks
None identified — this is scheduling/liveness logic with no auth, parsing, or external-input surface.
Level of scrutiny
High. This is the core event-loop tick path and the --hot reload state machine — code that runs on every bun run and every file save under --hot. The PR went through ~14 commits responding to a long chain of subtle ordering/race findings (unhandled-error-counter short-circuit, concurrent_tasks race, Windows forever_timer keeping uv_loop_alive true, microtask-drain placement relative to liveness checks, stranded hot_reload_deferred under coalesced watcher events). Each individual fix looks correct in isolation, but the cumulative interaction between reload(), reportExceptionInHotReloadedModuleIfNeeded(), loadEntryPoint's watcher loop, and tickPossiblyForever() is now intricate enough that a maintainer should sign off on the final shape — particularly the Windows branch that subtracts forever_timer from raw uv_loop_t.active_handles and the decision to drain microtasks at the tail of reload() for all call sites.
Other factors
- All prior inline findings are addressed and resolved; no open bugs in the current revision.
- Good regression coverage (6 tests pinning the TLA exit, the unhandled-rejection edge, and the
--hotreload-after-abandoned-TLA path), though the most contorted edge cases (coalesced watcher events, ref'd→abandoned transition) are not directly tested. - A prior attempt at this same fix (#27215) was reverted for test-runner regressions, which is why this version deliberately leaves
waitForPromise/loadEntryPointForTestRunner/loadPreloadsuntouched — that scoping decision itself is worth a human nod.
630731d to
52a136f
Compare
Hot reload 'should work with sourcemap generation' flakes at ~5% on windows-11-aarch64 under qemu: the 10s per-test timeout fires before 50 hot-reload cycles complete, independent of this PR. Cross-PR evidence from the last 24h: - #28645 retriggered 3 times for the same test - #29945, #30018, #30026 all retriggered - #29587 build 51335: documented as 'known hot-reload flake' - #29735 open: test-level workaround in progress - #29740 merged: runtime fix for a related --hot sourcemap race Local repro: identical ~5% rate at my tip (e86f538), at 5215594 (before the stranded-defer fix), and at 0ce2096 (before the forever-timer carve-out). Rate unchanged across the three, confirming the flake is not in this PR's footprint. Same persistent test-http-should-emit-close flake on 2019 x64-baseline (seen in 51163, 51423, 51448, 51473, 51503, 51528, 51604).
There was a problem hiding this comment.
No new issues from the bug hunter on this revision — but this rewires core event-loop liveness (TLA exit semantics, reload() deferral, platform-specific libuv internals on Windows), so it warrants a human pass before merge.
Extended reasoning...
Overview
This PR changes how Bun handles unsettled top-level await in the main entry point. It touches three areas:
src/jsc/event_loop.zig: addshasAnyHandleWork(),hasAnyHandleWorkIgnoringForeverTimer()(which on Windows reads raw libuv loop fields —active_handles,active_reqs.count,pending_reqs_tail,endgame_handles— and subtracts theforever_timercontribution), andwaitForPromiseOrLoopExit().src/jsc/VirtualMachine.zig: routesloadEntryPoint's TLA wait through the new helper (both watcher and non-watcher branches), rewritesreload()'s.pendingdeferral to distinguish abandoned TLAs from ref'd-I/O TLAs, adds a stranded-flag consumer inreportExceptionInHotReloadedModuleIfNeeded, and adds a post-reloadEntryPointmicrotask drain.test/regression/issue/29546.test.ts: six regression tests covering the exit, the keep-alive guard, theisEventLoopAlivevshasAnyHandleWorkdistinction, theunhandledRejection-handler microtask race, and--hotreload after an abandoned TLA.
Security risks
None identified. This is process-lifetime / event-loop scheduling logic; no auth, crypto, parsing of untrusted input, or permissions surface is touched.
Level of scrutiny
High. This is production-critical runtime code that determines when bun run exits and how --hot decides to reload. The change history on this PR (15+ iterations, each fixing a subtle race or platform divergence found in the previous one) demonstrates exactly how delicate the liveness predicate is. The Windows branch of hasAnyHandleWorkIgnoringForeverTimer reaches directly into libuv's uv_loop_t struct layout and applies a heuristic subtraction — that alone deserves a maintainer's eyes. The reload() rewrite changes when the module registry is torn down relative to in-flight TLA continuations, which is a meaningful semantic shift for --hot.
Other factors
- All prior inline review comments (mine, CodeRabbit's) are now resolved; the accidental-revert commit (630731d) flagged in my last review has been removed from history and the source changes are restored.
- The bug-hunting system found no new issues on this revision.
- CI on the latest commit shows only infrastructure failures (BuildKite agent creation / fleet issues), not test failures — but that also means the latest revision hasn't yet had a clean full-matrix run, including the Windows lanes that exercise the libuv-specific path.
- A prior attempt at this fix (#27215) caused test-runner regressions; this PR carefully scopes the new behavior to
loadEntryPointonly, but the interaction surface (hot-reload deferral,forever_timer, microtask draining order) is broad enough that a human should sign off on the design.
This is well outside the "simple, mechanical, obvious" bar for auto-approval.
…0263/30266/30267)
The stranded-`hot_reload_deferred` consumer at the `.pending` arm calls `reload()`, whose in-`reload()` microtask drain now settles the NEW `pending_internal_promise`. But `var promise` at the top of the function was captured from the OLD `pending_internal_promise` — if the new module body synchronously rejects (top-level `throw`, transpile error), the fresh `.rejected` is missed because the outer `switch` dispatches on the stale capture and the bare `return` at the end of the `.pending` arm fires before the `.rejected` arm can run. `tickPossiblyForever` then blocks in `loop.tick()` before the next main-loop iteration re-fetches, so the rejection sits unreported until another save or `forever_timer`. Tail-recurse after the reload so the next frame re-reads `this.pending_internal_promise` at its own line 720 and dispatches on the fresh status. One-line fix, zero risk (the recursion can only go one level deeper before the flag consumer guard stops it).
There was a problem hiding this comment.
No remaining issues from my side — every prior comment has been addressed and the latest commit picks up the tail-recurse fix. I'm not auto-approving only because this touches the core event-loop liveness predicate, loadEntryPoint, and the --hot reload state machine (with Windows-specific raw uv-loop field reads), which feels like it warrants a human sign-off.
Extended reasoning...
Overview
This PR changes how Bun handles an unsettled top-level await in the main entry point so the process exits cleanly (matching Node.js) instead of busy-spinning on POSIX or hanging forever on Windows. It touches three files:
src/jsc/event_loop.zig: addshasAnyHandleWork(),hasAnyHandleWorkIgnoringForeverTimer()(Windows reads rawuv_loop_tfields:active_handles,active_reqs.count,pending_reqs_tail,endgame_handles), andwaitForPromiseOrLoopExit().src/jsc/VirtualMachine.zig: routesloadEntryPoint's TLA wait through the new variant; reworksreload()'s.pendingarm to distinguish abandoned TLAs from in-flight ref'd I/O; adds a stranded-hot_reload_deferredconsumer inreportExceptionInHotReloadedModuleIfNeededwith a tail-recurse to re-read the fresh promise; adds a post-reloadEntryPointmicrotask drain.test/regression/issue/29546.test.ts: six regression tests including a--hotwatcher round-trip test.
Security risks
None identified. No auth, crypto, permissions, or untrusted-input parsing is involved; the change is purely event-loop scheduling/liveness logic.
Level of scrutiny
High. loadEntryPoint and EventLoop.autoTick/waitForPromise are on the hot path of every bun run / bun -e invocation, and the --hot reload state machine has subtle ordering invariants (defer vs. proceed, microtask drain timing, Windows forever_timer ref accounting). The PR went through ~15 iterative review rounds where real correctness issues were found and fixed at each step (unhandled-error-counter short-circuit, concurrent_tasks race, microtask-drain-before-liveness ordering, Windows forever_timer keeping uv_loop_alive permanently true, zombie-continuation regression in reload(), stranded-defer triggers, accidental source revert in a "ci: retrigger" commit). All of those are now resolved and pinned by tests, but the density of subtle interactions argues for a maintainer's eyes on the final shape rather than bot-only approval.
Other factors
- All 15+ prior inline comments (mine and CodeRabbit's) are marked resolved; the most recent commit (360dfa4) directly implements my last suggestion (tail-recurse after
reload()). - The bug-hunting system found no issues on this revision.
- Test coverage is good: the six new tests pin the primary fix, the regression guard, the
isEventLoopAlivevshasAnyHandleWorkdistinction, the microtask-drain-before-break, and the--hotreload-after-abandoned-TLA path. - This is not a mechanical/config change — it's a semantic change to runtime exit behavior and the hot-reload defer/proceed decision, which is exactly the kind of design surface a human should ratify.
Build 51653 had 13 jobs expire due to BuildKite agent outage (Expired status, not test failures). Plus the usual persistent flakes: - test-http-should-emit-close (Windows, all 3 lanes) - hot.test.ts sourcemap generation (documented cross-PR flake) - jsc-stress mprotect failed: 487 - bun-install-registry hoisting peers - s3-storage-class UnknownError - node-http-uaf EADDRINUSE (port 0 race) - spawn.test.ts timeout None of these are caused by this PR — the core fix (loadEntryPoint TLA exit via waitForPromiseOrLoopExit, reload() defer/proceed gate via hasAnyHandleWorkIgnoringForeverTimer) is unchanged and all its regression tests pass.
…30217) Build 51679: only ERROR is test-http-should-emit-close, the same Windows flake that's been hitting every build on this PR since 51163. Scan of 80 recent buildkite builds found this test failing on 67 unrelated PRs with identical signature (timeout on Windows only — 11 aarch64, 2019 x64, 2019 x64-baseline). Multiple commits on main explicitly document it as '~100% flake on main', not caused by any single PR: - 5807f25, 30ba155, 087b80c, 9367f69, c53835a, 3eb1f30 Dedicated fix attempt in #30217 identified the root cause as a Windows AFD-poll race in HTTP socket close (NodeHTTPResponse.doPause TODO), unrelated to TLA runtime. #30217 was closed; flake persists. All other CI issues on this PR are retried-and-passed flakes (jsc-stress, AsyncLocalStorage, bun-install-registry).
There was a problem hiding this comment.
No new issues found on this pass, but this rewires core event-loop liveness (TLA exit semantics, --hot reload's .pending deferral, Windows-specific uv handle accounting) — worth a human sign-off given the blast radius and the number of iterations it took to converge.
Extended reasoning...
Overview
This PR changes how Bun handles unsettled top-level await in the main entry point. It adds three new EventLoop helpers (hasAnyHandleWork, hasAnyHandleWorkIgnoringForeverTimer, waitForPromiseOrLoopExit) in src/jsc/event_loop.zig, and routes loadEntryPoint through them instead of the blocking waitForPromise. It also substantially reworks the --hot reload path in src/jsc/VirtualMachine.zig: reload()'s .pending arm now drains microtasks and consults a forever-timer-aware liveness check before deferring, reportExceptionInHotReloadedModuleIfNeeded gains a stranded-flag consumer with tail-recursion, and reload() drains microtasks after reloadEntryPoint for out-of-tick() callers. A 256-line regression test file covers six scenarios including a --hot round-trip test.
Security risks
None identified. No auth, crypto, permissions, or untrusted-input parsing is touched. The change is purely about event-loop scheduling and process-exit timing.
Level of scrutiny
High. This is the core runtime event loop — the code that decides whether a Bun process keeps running or exits. The change has platform-divergent behavior (POSIX uws active counter vs. Windows uv_loop_alive / active_handles / active_reqs), interacts with hot-reload state machines, and introduces a new exit state (loadEntryPoint returning with a still-.pending promise) that downstream code was not originally designed for. The Windows hasAnyHandleWorkIgnoringForeverTimer heuristic directly reads libuv internals (active_handles, active_reqs.count, pending_reqs_tail, endgame_handles) and subtracts an assumed contribution — the in-code comment explicitly calls it a heuristic.
Other factors
The PR went through roughly twenty review iterations over two weeks, with multiple substantive bugs found and fixed along the way (using isEventLoopAlive instead of raw handle liveness, missing concurrent_tasks check, missing microtask drain before the liveness break, the forever_timer Windows issue, the ref'd-vs-abandoned TLA distinction in reload(), an accidental full revert in a "ci: retrigger" commit, etc.). All inline comments are resolved except one trivial comment-wording nit about a hardcoded line number. An earlier attempt at this same fix (#27215) caused test-runner regressions, which is why the scope was carefully narrowed to loadEntryPoint only. Given the subtlety, the platform-specific reasoning, and the history, this should get human eyes before merge.
Hardcoded line references drift the moment anything is inserted above the function, and VirtualMachine.zig is 2500+ lines and actively churning. Naming the variable is stable under refactor and equivalent in clarity.
There was a problem hiding this comment.
I didn't find any new issues, but this changes core event-loop liveness semantics (waitForPromiseOrLoopExit, hasAnyHandleWork*) plus the --hot reload state machine, with platform-specific libuv internals on Windows — worth a human sign-off given how many subtle interactions surfaced during iteration.
Extended reasoning...
Overview
This PR changes how Bun's main entry point waits on a top-level await: instead of spinning/hanging when only unref'd work is pending, loadEntryPoint now routes through a new EventLoop.waitForPromiseOrLoopExit that breaks once hasAnyHandleWork() reports nothing left to make progress. It also adds hasAnyHandleWorkIgnoringForeverTimer() (Windows-specific carve-out reading raw uv_loop.active_handles / active_reqs.count / pending_reqs_tail / endgame_handles), reworks reload()'s .pending deferral and reportExceptionInHotReloadedModuleIfNeeded() so --hot recovers after an abandoned TLA, and adds a post-reloadEntryPoint microtask drain. Six regression tests cover the main fix plus the unhandledRejection/--hot edge cases.
Security risks
None identified. This is event-loop control-flow / process-liveness logic; no auth, crypto, parsing of untrusted input, or permission surfaces are touched.
Level of scrutiny
High. src/jsc/event_loop.zig and src/jsc/VirtualMachine.zig are core runtime — the predicate that decides whether the process exits, and the --hot reload state machine. The Windows branch of hasAnyHandleWorkIgnoringForeverTimer reads libuv internals directly and subtracts a heuristic forever_timer contribution. The PR went through ~20 review iterations in which the bug-hunting system found several real issues (the isEventLoopAlive short-circuit on unhandled_error_counter, missing concurrent_tasks check, missing microtask drain before the liveness break, the Windows forever_timer ref'd-handle problem, the zombie-continuation regression in reload()'s .pending fall-through) — all addressed, but the density of subtle interactions argues for a maintainer who knows the event-loop invariants to sign off.
Other factors
- An earlier attempt at this fix (#27215) caused test-runner regressions; this PR explicitly scopes the new behavior to
loadEntryPointonly and leaveswaitForPromise/loadEntryPointForTestRunner/waitForPromiseWithTerminationunchanged, which is the right call but means a maintainer should confirm the scoping. - One open inline nit remains (the
beforeExitfollow-up at event_loop.zig:700) — explicitly labeled non-blocking / out-of-scope. - CI build #51693 has
test-http-should-emit-close-when-connection-is-aborted.tstiming out on three Windows lanes; commit messages reference this as a known cross-PR flake (#30217), but a human should confirm it's unrelated before merge. - This is a user-visible behavior change (unsettled TLA now exits instead of waiting), aligning with Node.js but still a semantic shift.
…mplete
Before this PR, an unsettled top-level await hung forever inside
`waitForPromise` and never reached `onBeforeExit`. With the new
`waitForPromiseOrLoopExit` bail-out, an abandoned TLA now flows into
`onBeforeExit`, which emits `process.on('beforeExit', …)`. A handler
that synchronously resolves the TLA queues the continuation as a JSC
microtask — but `isEventLoopAlive()` checks handles/tasks/refs, not
microtasks, so the inner `while` below would skip and the
continuation would be silently dropped before process exit.
Drain microtasks after `dispatchOnBeforeExit` (and after the
re-dispatch inside the outer loop, for symmetry) so a resolving
beforeExit handler's continuation runs before liveness is rechecked.
Same pattern the new `waitForPromiseOrLoopExit` uses for
`unhandledRejection` handlers — this is the sibling fix one frame
later in the exit sequence.
Test: `beforeExit handler that resolves the TLA runs to completion`
pins it. Without the drain the test fails (got=done is never printed);
with it, the continuation runs and stdout matches.
| /// | ||
| /// `concurrent_tasks.isEmpty()` is a single atomic acquire-load; it's safe | ||
| /// from the JS thread without locking. Including it closes a narrow race | ||
| /// where a thread (e.g. the hot-reloader watcher at `hot_reloader.zig:278`) |
There was a problem hiding this comment.
🟡 nit: the doc-comment on hasAnyHandleWork() references "the hot-reloader watcher at hot_reloader.zig:278" — a hardcoded cross-file line number that will drift on any edit above that line. This is the same drift hazard you already accepted and fixed in this PR for the "line 720" reference in VirtualMachine.zig (commit 93e2313). For consistency, replace with a stable symbolic reference like "in HotReloader.Task.enqueue()".
Extended reasoning...
What the issue is. The new doc-comment on hasAnyHandleWork() (src/jsc/event_loop.zig:591) reads: "Including it closes a narrow race where a thread (e.g. the hot-reloader watcher at hot_reloader.zig:278) pushes to the concurrent queue without bumping concurrent_ref…". The hot_reloader.zig:278 reference is a hardcoded cross-file line number embedded in a doc-comment. Today line 278 is that.reloader.enqueueTaskConcurrent(&that.concurrent_task); inside HotReloader.Task.enqueue(), so the reference is accurate — but it will silently go stale the moment anything is inserted or removed above that line in hot_reloader.zig.
Why this is worth a one-line fix in this PR. This is exactly the same drift hazard that was already flagged and fixed in this very PR. Resolved inline-comment 3187088494 pointed out that "the stale capture at line 720 above" in VirtualMachine.zig would rot, and commit 93e2313 ("comment: refer to the local promise by name, not line number") changed it to "the stale local promise above". The author has demonstrated agreement that hardcoded line numbers in comments are a maintenance hazard worth fixing. Consistency within the same PR argues for the same treatment here — and a cross-file line reference is strictly more brittle than a same-function one, since hot_reloader.zig churns independently of event_loop.zig and there is no "above" hint to fall back on.
Why existing conventions don't excuse it. A grep of src/jsc/*.zig shows this is the only file.zig:NNN cross-file doc-comment reference in the directory, so it is not an established convention being followed — it's a one-off introduced by this PR.
Step-by-step proof of the drift.
- Today:
hot_reloader.zig:278isthat.reloader.enqueueTaskConcurrent(&that.concurrent_task);insideTask.enqueue(). The doc-comment inevent_loop.zig:591says "athot_reloader.zig:278". Correct. - A future PR adds a single field, import, or helper function anywhere in
hot_reloader.ziglines 1–277. Every line below shifts down by 1. hot_reloader.zig:278is now whatever the old line 277 was — perhapsthat.next = null;or a closing brace. The doc-comment inevent_loop.zigstill says "athot_reloader.zig:278".- A reader following the comment opens
hot_reloader.zigto line 278, finds something unrelated toenqueueTaskConcurrent, and is left guessing what the "e.g." was supposed to point at. - No compiler, linter, or formatter will ever catch this. The comment is never updated because nothing enforces it.
Impact. Purely cosmetic — no runtime effect whatsoever. The reference lives inside an "e.g." parenthetical, so it isn't load-bearing for understanding the function's contract. This should not block the PR.
How to fix. Replace "the hot-reloader watcher at hot_reloader.zig:278" with "the hot-reloader watcher in HotReloader.Task.enqueue()" (or just "the hot-reloader watcher's enqueueTaskConcurrent"). The function name is a stable symbolic anchor that survives line-number churn, and if it's ever renamed, grep/IDE will surface every reference. One-word edit, zero risk.
…tions 207afb3 tried to fix a nit claude-bot explicitly labeled non-blocking / out-of-scope: that `onBeforeExit` doesn't drain microtasks after `dispatchOnBeforeExit`, so a `beforeExit` handler that resolves the abandoned-TLA promise has its continuation dropped. I added a drain (and a regression test) and it broke `test/js/node/test/parallel/test-event-capture-rejections.js` on Windows aarch64 (Node compatibility test, upstream code we can't modify — verified 0/10 pass locally after the change, 10/10 before). The drain causes a microtask to fire during onBeforeExit that triggers one of the `process.on('unhandledRejection', err => throw err)` handlers the test installs, producing the `error: kaboom` at line 80:55 in CI build 51788. Revert the onBeforeExit change and drop the regression test. The beforeExit-resolves-TLA case remains a follow-up (as the bot originally labeled it), to be investigated separately without perturbing the Node compatibility surface.
|
Closing since #29546 was closed as not-reproducible on 1.3.13 Linux. The bug IS reproducible — I verified it end-to-end on this branch:
The fix routes If the issue gets reopened with a fresh repro, happy to rebase this. |
Fixes #29546
Fixes #14951
Reproduction
Before: Windows hangs forever, POSIX spins one core at 100% for the full 2s before the timer fires.
After: exits cleanly with no output (matches Node.js).
Root cause
AbortSignal.timeout(N)inserts a timer into Bun's own heap but deliberately does not bumpactive_timer_count— it's unref'd by design. When the only thing holding up a top-levelawaitis that unref'd timer,autoTick()'s two platform paths both misbehave:POSIX:
loop.isActive()is false, so we take thetickWithoutIdle()branch (zero-timeout poll) and thendrainTimers(). That drains the unref'd timer eventually — but only after thewaitForPromiseloop spins until the deadline, burning one full core.Windows:
tickWithoutIdle→us_loop_pump→uv_run(UV_RUN_NOWAIT).uv_runearly-returns whenuv__loop_alive()is false (vendor/libuv/src/win/core.c:719) — and with no ref'd handles it is. So the uv timer that would normally drain Bun's heap viaonUVTimernever fires. Hang forever.Fix
Add two helpers on
EventLoop:hasAnyHandleWork()— returns true if anything could still wake the loop (active uv/uws handles, queued tasks, concurrent refs, immediates). Intentionally NOTVirtualMachine.isEventLoopAlive(), which short-circuits onunhandled_error_counter != 0— a side-path unhandled rejection during a TLA wait must NOT abandon an in-flight fetch that will resolve the TLA.waitForPromiseOrLoopExit()— identical towaitForPromiseexcept it breaks whenhasAnyHandleWork()returns false.Route
loadEntryPoint(both the watcher and non-watcher branches) through it. This is the "run the main script" entry point —bun run foo.ts,bun foo.ts,bun -e '…'.waitForPromiseitself is unchanged, andloadEntryPointForTestRunner/loadPreloads/waitForPromiseWithTermination(workers) also keep the oldwaitForPromisecontract. Several callers rely on "returns with the promise resolved" — most visiblyExpect.getValueAsToThrow(src/bun.js/test/expect.zig:589), which asserts.pending => unreachableafter the wait. Breaking that contract is what caused the test-runner regressions in the earlier attempt at this fix (#27215). A test-file's top-level-await-only-held-by-an-unref'd-timer can still hangbun testtoday; that's a narrower, separate bug and out of scope here.The new behavior matches Node.js: an unsettled top-level await in the main script exits cleanly (with a warning, on Node ≥ 20) rather than waiting on unref'd handles.
Verification
Checked out
origin/mainonsrc/bun.js/event_loop.zig+src/bun.js/VirtualMachine.zig, rebuilt, and ran the tests → 2 fail on the 5s test timeout (hang confirmed), 2 pass (the "keeps the loop alive" regression guard + the "unhandled rejection" guard). Restored the fix, rebuilt → all 4 pass in <1s each.The "fires when something else keeps the loop alive" test is the regression guard for the original bug — with a ref'd
setTimeoutholding the loop open, the unref'dAbortSignal.timeoutstill fires on schedule and the listener still runs.The "unhandled rejection mid-TLA" test pins the
hasAnyHandleWork()vsisEventLoopAlive()distinction: a side-path unhandled rejection must NOT cause the TLA wait to bail while real work (here, asetTimeout) is still pending. If the liveness check ever gets simplified toisEventLoopAlive(), this test will catch it.Also re-ran
test/js/node/timers/*,test/regression/issue/28756.test.ts(AbortSignal memory-leak), and the TLA regressions (#20100 / #21257 / #24007 / #28914) — all pass.