Skip to content

Exit unsettled top-level await instead of hanging / busy-looping#29549

Closed
robobun wants to merge 33 commits into
mainfrom
farm/ed4a7fa6/fix-abortsignal-timeout-tla-hang
Closed

Exit unsettled top-level await instead of hanging / busy-looping#29549
robobun wants to merge 33 commits into
mainfrom
farm/ed4a7fa6/fix-abortsignal-timeout-tla-hang

Conversation

@robobun

@robobun robobun commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Fixes #29546
Fixes #14951

Reproduction

async function run(signal: AbortSignal): Promise<string> {
  return new Promise((resolve) => {
    signal.addEventListener('abort', () => resolve('run aborted'));
  });
}

const r = await run(AbortSignal.timeout(2000));
console.log(r);

Before: Windows hangs forever, POSIX spins one core at 100% for the full 2s before the timer fires.
After: exits cleanly with no output (matches Node.js).

Root cause

AbortSignal.timeout(N) inserts a timer into Bun's own heap but deliberately does not bump active_timer_count — it's unref'd by design. When the only thing holding up a top-level await is that unref'd timer, autoTick()'s two platform paths both misbehave:

  • POSIX: loop.isActive() is false, so we take the tickWithoutIdle() branch (zero-timeout poll) and then drainTimers(). That drains the unref'd timer eventually — but only after the waitForPromise loop spins until the deadline, burning one full core.

  • Windows: tickWithoutIdleus_loop_pumpuv_run(UV_RUN_NOWAIT). uv_run early-returns when uv__loop_alive() is false (vendor/libuv/src/win/core.c:719) — and with no ref'd handles it is. So the uv timer that would normally drain Bun's heap via onUVTimer never fires. Hang forever.

Fix

Add two helpers on EventLoop:

  • hasAnyHandleWork() — returns true if anything could still wake the loop (active uv/uws handles, queued tasks, concurrent refs, immediates). Intentionally NOT VirtualMachine.isEventLoopAlive(), which short-circuits on unhandled_error_counter != 0 — a side-path unhandled rejection during a TLA wait must NOT abandon an in-flight fetch that will resolve the TLA.
  • waitForPromiseOrLoopExit() — identical to waitForPromise except it breaks when hasAnyHandleWork() returns false.

Route loadEntryPoint (both the watcher and non-watcher branches) through it. This is the "run the main script" entry point — bun run foo.ts, bun foo.ts, bun -e '…'.

waitForPromise itself is unchanged, and loadEntryPointForTestRunner / loadPreloads / waitForPromiseWithTermination (workers) also keep the old waitForPromise contract. Several callers rely on "returns with the promise resolved" — most visibly Expect.getValueAsToThrow (src/bun.js/test/expect.zig:589), which asserts .pending => unreachable after the wait. Breaking that contract is what caused the test-runner regressions in the earlier attempt at this fix (#27215). A test-file's top-level-await-only-held-by-an-unref'd-timer can still hang bun test today; that's a narrower, separate bug and out of scope here.

The new behavior matches Node.js: an unsettled top-level await in the main script exits cleanly (with a warning, on Node ≥ 20) rather than waiting on unref'd handles.

Verification

$ bun bd test test/regression/issue/29546.test.ts
✓ AbortSignal.timeout awaited at top-level does not hang or spin
✓ AbortSignal.timeout fires when something else keeps the loop alive
✓ top-level await on a never-resolving promise exits cleanly
✓ unhandled rejection mid-TLA does not abandon an in-flight wait

Checked out origin/main on src/bun.js/event_loop.zig + src/bun.js/VirtualMachine.zig, rebuilt, and ran the tests → 2 fail on the 5s test timeout (hang confirmed), 2 pass (the "keeps the loop alive" regression guard + the "unhandled rejection" guard). Restored the fix, rebuilt → all 4 pass in <1s each.

The "fires when something else keeps the loop alive" test is the regression guard for the original bug — with a ref'd setTimeout holding the loop open, the unref'd AbortSignal.timeout still fires on schedule and the listener still runs.

The "unhandled rejection mid-TLA" test pins the hasAnyHandleWork() vs isEventLoopAlive() distinction: a side-path unhandled rejection must NOT cause the TLA wait to bail while real work (here, a setTimeout) is still pending. If the liveness check ever gets simplified to isEventLoopAlive(), this test will catch it.

Also re-ran test/js/node/timers/*, test/regression/issue/28756.test.ts (AbortSignal memory-leak), and the TLA regressions (#20100 / #21257 / #24007 / #28914) — all pass.

@robobun

robobun commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 4:06 PM PT - May 5th, 2026

@robobun, your commit ab4b3b4 has 1 failures in Build #51815 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 29549

That installs a local version of the PR into your bun-29549 executable, so you can run:

bun-29549 --bun

@github-actions

Copy link
Copy Markdown
Contributor

Found 2 issues this PR may fix:

  1. awaiting top level no-op promise that never resolves is missing an error/warn, causing 100% cpu usage #14951 - Exact match: await new Promise(r => {}) at top level causes 100% CPU usage, which is precisely the unsettled TLA busy-loop that waitForPromiseOrLoopExit now breaks out of cleanly.
  2. Bun process does not exit after Tinypool destroy() completes (worker threads remain alive) #29173 - After Tinypool.destroy() completes, worker threads are terminated but the process hangs on the top-level await because no ref'd handles remain; the new loop-exit check would cause the process to exit cleanly instead of hanging.

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #14951
Fixes #29173

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds event-loop liveness checks and a new wait variant so awaiting a promise can exit early when the loop has no remaining handle work; integrates this into VirtualMachine.loadEntryPoint. Adds regression tests validating top-level await behavior with timeouts, never-resolving promises, and unhandled rejections.

Changes

Cohort / File(s) Summary
Virtual Machine changes
src/bun.js/VirtualMachine.zig
loadEntryPoint wait logic adjusted: watcher path breaks pending-promise loop if promise still .pending but eventLoop().hasAnyHandleWork() is false; non-watcher path now uses eventLoop().waitForPromiseOrLoopExit(...) instead of waitForPromise(...), allowing earlier exit when the loop has no handle work.
Event loop helpers
src/bun.js/event_loop.zig
Added pub fn hasAnyHandleWork(this: *EventLoop) bool and pub fn waitForPromiseOrLoopExit(this: *EventLoop, promise: jsc.AnyPromise) void. hasAnyHandleWork aggregates liveness signals (loop handles, task counters, ref state, immediate queues); waitForPromiseOrLoopExit mirrors the tick/microtask loop but can terminate if no handle work remains.
Regression tests
test/regression/issue/29546.test.ts
Adds five Bun regression tests that spawn separate Bun processes to exercise top-level await with AbortSignal.timeout(...), unref'd timers, never-resolving promises, and unhandled rejections; each asserts timing, stdout, and exit codes to ensure processes exit correctly (no hang or busy-spin).
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: fixing top-level await hanging/busy-looping when only held by unref'd timers.
Description check ✅ Passed The description comprehensively covers root cause, fix details, verification, and regression testing, but lacks explicit 'How did you verify your code works?' section as required by template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

This PR may be a duplicate of:

  1. fix: detect unsettled top-level await and exit instead of busy-waiting #27215 - Also fixes unsettled top-level await causing busy-looping/hanging by detecting when the event loop has no active handles while a TLA promise is still pending

🤖 Generated with Claude Code

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/bun.js/VirtualMachine.zig (1)

2145-2167: ⚠️ Potential issue | 🟠 Major

Don't treat a still-pending preload as success.

waitForPromiseOrLoopExit() can now return with promise.status() == .pending, but loadPreloads() still only returns the promise on .rejected. In that case the preload silently falls through and the main entry point starts loading even though the preload never finished.

Suggested fix
-        if (promise.status() == .rejected)
-            return promise;
+        switch (promise.status()) {
+            .fulfilled => {},
+            else => return promise,
+        }

Also applies to: 2169-2170

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bun.js/VirtualMachine.zig` around lines 2145 - 2167, The code assumes
waitForPromiseOrLoopExit(.{ .internal = promise }) always yields
resolved/rejected, but it can return while promise.status() == .pending causing
loadPreloads()/pending_internal_promise to silently fall through; update the
non-watcher branch to check this pending_internal_promise?.status() after
eventLoop.waitForPromiseOrLoopExit and treat .pending like an error or continue
waiting (e.g., loop until status != .pending or explicitly convert pending to
.rejected), and mirror the same guard/handling in loadPreloads() (and the
referenced lines ~2169-2170) so the main entry point never proceeds when a
preload remains pending; reference functions/fields:
eventLoop.waitForPromiseOrLoopExit, pending_internal_promise, loadPreloads,
isWatcherEnabled, eventLoop.performGC.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/regression/issue/29546.test.ts`:
- Around line 24-29: Remove the brittle assertions that require an empty stderr
in the subprocess regression tests: delete any
expect(stripAsanWarning(stderr)).toBe("") (or expect(stderr).toBe("")) checks in
the tests that spawn bunExe() with bunEnv and instead rely on stdout and
exitCode as the regression signal; search for uses of stripAsanWarning and
remove the empty-stderr expectations (including the other occurrences in this
test file).

---

Outside diff comments:
In `@src/bun.js/VirtualMachine.zig`:
- Around line 2145-2167: The code assumes waitForPromiseOrLoopExit(.{ .internal
= promise }) always yields resolved/rejected, but it can return while
promise.status() == .pending causing loadPreloads()/pending_internal_promise to
silently fall through; update the non-watcher branch to check this
pending_internal_promise?.status() after eventLoop.waitForPromiseOrLoopExit and
treat .pending like an error or continue waiting (e.g., loop until status !=
.pending or explicitly convert pending to .rejected), and mirror the same
guard/handling in loadPreloads() (and the referenced lines ~2169-2170) so the
main entry point never proceeds when a preload remains pending; reference
functions/fields: eventLoop.waitForPromiseOrLoopExit, pending_internal_promise,
loadPreloads, isWatcherEnabled, eventLoop.performGC.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 808f2194-8307-4e09-ba3f-9936db83c37b

📥 Commits

Reviewing files that changed from the base of the PR and between 7c73f05 and c6fe5d9.

📒 Files selected for processing (3)
  • src/bun.js/VirtualMachine.zig
  • src/bun.js/event_loop.zig
  • test/regression/issue/29546.test.ts

Comment thread test/regression/issue/29546.test.ts Outdated
Comment thread src/jsc/event_loop.zig
Comment thread src/jsc/event_loop.zig
Comment thread test/regression/issue/29546.test.ts Outdated
Comment thread src/jsc/VirtualMachine.zig
Comment thread test/regression/issue/29546.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/regression/issue/29546.test.ts`:
- Around line 4-25: Remove the long file-level bug-history block at the top of
test/regression/issue/29546.test.ts and replace it with only the issue URL and a
one-line summary; keep the per-test rationale comments that explain why the
timings/fixtures detect the regression. Concretely, trim out the multi-paragraph
explanation referencing loadEntryPoint, EventLoop.waitForPromiseOrLoopExit,
waitForPromise and Expect.toThrow so the header contains just the issue link and
a concise summary, while leaving any inline test comments that describe test
design and timing choices intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 25b27b14-7959-457b-b514-43cc96534d8e

📥 Commits

Reviewing files that changed from the base of the PR and between 2858b4c and 6a6c742.

📒 Files selected for processing (1)
  • test/regression/issue/29546.test.ts

Comment thread test/regression/issue/29546.test.ts
Comment thread src/jsc/event_loop.zig

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/bun.js/event_loop.zig`:
- Around line 591-600: The hasAnyHandleWork function is checking
vm.event_loop_handle.?.isActive() which on Windows references the libuv loop
instead of the uSockets loop; update that check to use
this.usocketsLoop().isActive() (preserving the optional/null-safety semantics
used elsewhere) so the early-exit condition used by waitForPromiseOrLoopExit
correctly observes uws I/O activity; locate hasAnyHandleWork and replace the
vm.event_loop_handle.?.isActive() term with an equivalent
this.usocketsLoop().isActive() call consistent with
autoTick()/tickPossiblyForever()/autoTickActive().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8b9bb2d9-017c-4b39-ba48-99d5c77dd809

📥 Commits

Reviewing files that changed from the base of the PR and between 6a6c742 and 1b814e9.

📒 Files selected for processing (1)
  • src/bun.js/event_loop.zig

Comment thread src/jsc/event_loop.zig
Comment thread src/jsc/event_loop.zig

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/bun.js/event_loop.zig (1)

591-600: ⚠️ Potential issue | 🟠 Major

Use this.usocketsLoop().isActive() instead of vm.event_loop_handle.?.isActive() on Windows.

On Windows, vm.event_loop_handle is the libuv loop while usocketsLoop() returns this.uws_loop (uSockets). Other functions in this file (autoTick(), tickPossiblyForever(), autoTickActive()) use usocketsLoop() for cross-platform consistency.

Checking the wrong loop could cause the early-exit condition in waitForPromiseOrLoopExit() to trigger incorrectly on Windows when there's still pending work in the uws I/O loop.

🔧 Suggested fix
 pub fn hasAnyHandleWork(this: *EventLoop) bool {
     const vm = this.virtual_machine;
-    return vm.event_loop_handle.?.isActive() or
+    return this.usocketsLoop().isActive() or
         vm.active_tasks > 0 or
         this.tasks.count > 0 or
         this.hasPendingRefs() or
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bun.js/event_loop.zig` around lines 591 - 600, Replace the libuv loop
check with the uSockets loop check in hasAnyHandleWork: currently it calls
vm.event_loop_handle.?.isActive(), which on Windows inspects the wrong loop;
change it to use this.usocketsLoop().isActive() so the function
(hasAnyHandleWork) correctly reflects uSockets activity (matching other
functions like autoTick, tickPossiblyForever, autoTickActive) and avoids
premature exit in waitForPromiseOrLoopExit().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/bun.js/event_loop.zig`:
- Around line 591-600: Replace the libuv loop check with the uSockets loop check
in hasAnyHandleWork: currently it calls vm.event_loop_handle.?.isActive(), which
on Windows inspects the wrong loop; change it to use
this.usocketsLoop().isActive() so the function (hasAnyHandleWork) correctly
reflects uSockets activity (matching other functions like autoTick,
tickPossiblyForever, autoTickActive) and avoids premature exit in
waitForPromiseOrLoopExit().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e8fba4f3-bb15-4626-9123-135496f56c50

📥 Commits

Reviewing files that changed from the base of the PR and between 1b814e9 and 71dfe8b.

📒 Files selected for processing (2)
  • src/bun.js/event_loop.zig
  • test/regression/issue/29546.test.ts

Comment thread src/jsc/VirtualMachine.zig

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix logic is sound and all previously raised concerns have been addressed, but the event-loop liveness semantics (especially on Windows where is libuv while is uSockets) are subtle enough that a human familiar with Bun's Windows port should give this a final pass before merge.

Extended reasoning...

Overview

PR adds and to , then routes (both watcher and non-watcher branches) through the new function. Also adds between and the liveness check in the watcher branch to prevent silent microtask loss. A regression test file with 5 targeted scenarios is included.

Security Risks

None — this is pure event-loop scheduling logic with no auth, crypto, or I/O boundary changes.

Level of Scrutiny

High. This touches the event loop tick loop, which is among the most critical code paths in the runtime. Correctness failures here manifest as silent data loss (TLA continuations dropped, module init skipped) rather than hard crashes, making them difficult to detect in CI.

Other Factors

All inline concerns raised in previous review passes are marked resolved: the / bug was fixed via the new , the worker TLA change was correctly reverted, is checked, the microtask drain is present, and assertion ordering in tests was fixed. The one remaining open question — whether (libuv on Windows) is the right liveness proxy vs (uSockets) — was raised by CodeRabbit and marked resolved without a visible code change. That distinction is harmless on POSIX (same object) but could produce an incorrect early-exit on Windows if a uSockets handle is active while libuv is not. The scope decision to leave and on the old path is explicitly documented and reasonable.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously identified issues have been addressed, but this PR introduces novel event loop control flow in critical infrastructure — a human should verify the final state.

Extended reasoning...

Overview

This PR adds two new event loop helpers ( and ) to fix POSIX busy-looping at 100% CPU and Windows infinite hangs when top-level await is held only by an unref'd timer. The changes touch and , with a regression test file added.

Security risks

None. This is event loop scheduling logic with no auth, crypto, or permission implications.

Level of scrutiny

High. The event loop is among the most critical code paths in Bun. The fix introduces a new liveness predicate () and a new wait variant (), both with subtle correctness requirements: they must be conservative (never exit while real work is pending) while also being aggressive enough to exit when only unref'd work remains. Multiple real bugs were found and addressed during this review cycle, indicating the code needed careful iteration.

Other factors

All of my previous inline bug reports (7 in total) were addressed in commit 36da1f7 — fixing the short-circuit issue, adding to the liveness check, adding microtask drains in both the non-watcher and watcher branches, and reverting the unneeded worker change. The test suite is comprehensive with 5 tests covering the regression, the keep-alive guard, never-resolving TLA, unhandled-rejection-during-wait, and the subtle unhandledRejection-handler-resolves-TLA case. No new bugs were surfaced by the bug hunting system. The PR is well-reasoned and in good shape; a human reviewer should verify the final implementation before merge.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All earlier review feedback has been addressed and I didn't find anything new this round, but this changes the core TLA wait/exit semantics in the event loop (and the prior attempt #27215 regressed), so it's worth a human pass before merge.

Extended reasoning...

Overview

This PR adds EventLoop.hasAnyHandleWork() and EventLoop.waitForPromiseOrLoopExit() in src/bun.js/event_loop.zig, and routes loadEntryPoint (both watcher and non-watcher branches) in src/bun.js/VirtualMachine.zig through the new wait variant. The net effect is that an unsettled top-level await with no ref'd handles now exits cleanly instead of busy-looping (POSIX) or hanging (Windows). A new regression test file with five subprocess tests pins the behavior.

Over six prior review rounds I flagged: (1) the original isEventLoopAlive() predicate short-circuiting on unhandled_error_counter, (2) an unintended change to waitForPromiseWithTermination, (3) missing concurrent_tasks in the liveness check, (4) a missing microtask drain between autoTick() and the liveness break, and (5) the same drain missing from the watcher branch. All of these have been addressed in the current diff, and the bug-hunting pass on the latest revision found nothing new.

Security risks

None. This is purely event-loop scheduling/exit logic; no auth, crypto, parsing, or untrusted-input surfaces are touched.

Level of scrutiny

High. event_loop.zig and VirtualMachine.loadEntryPoint are on the hot path for every bun run / bun foo.ts invocation, and this PR changes when the process exits — a user-visible semantic. The previous attempt at this fix (#27215) caused test-runner regressions, and this PR went through several correctness iterations before stabilizing. The liveness predicate aggregates seven signals across two platforms; subtle mis-ordering or a missed signal can silently drop work. That warrants a maintainer's eyes, not an auto-approve.

Other factors

  • waitForPromise itself is deliberately unchanged, and loadEntryPointForTestRunner / loadPreloads keep the old contract (the PR description scopes the test-runner case as a separate bug). That scoping decision is reasonable but worth a maintainer ack.
  • CodeRabbit raised usocketsLoop() vs event_loop_handle on Windows; the thread is marked resolved but the code still uses event_loop_handle (matching isEventLoopAliveExcludingImmediates). Likely correct, but a Windows-familiar reviewer should confirm.
  • CI shows two failures (in-process-cron, bun-install) that the latest commit message attributes to pre-existing flakes; they don't appear related to this change.
  • The microtask drain is now gated behind \!hasAnyHandleWork() (only runs on the about-to-break iteration), which keeps the steady-state loop identical to waitForPromise — a nice property for perf, but also means the drain only covers the one known edge case rather than every iteration.

robobun and others added 9 commits May 4, 2026 10:36
An unref'd Bun timer (e.g. AbortSignal.timeout()) awaited at the top
level with no ref'd handle keeping the loop alive hit two distinct
event-loop bugs:

  POSIX:   autoTick() took the tickWithoutIdle() branch (zero-timeout
           poll) and drainTimers() fired the timer eventually, but only
           after a tight busy loop burning ~100% CPU for the full wait.

  Windows: tickWithoutIdle -> us_loop_pump -> uv_run(UV_RUN_NOWAIT).
           uv_run early-returns when uv__loop_alive() is false, so the
           unref'd uv timer that would drain Bun's heap never fired.
           The process hung forever.

Add EventLoop.waitForPromiseOrLoopExit: like waitForPromise, but bails
when isEventLoopAlive() goes false. Route the three TLA entry paths
(loadEntryPoint, loadEntryPointForTestRunner, loadPreloads) plus the
watcher variants through it, and give waitForPromiseWithTermination the
same break so workers with unsettled TLA exit cleanly too.

waitForPromise itself is unchanged — its contract ('returns with the
promise resolved') is load-bearing for callers like Expect's toThrow.

Matches Node.js, which also exits an unsettled top-level await without
waiting on unref'd handles.

Fixes #29546
Address two review findings:

1. (coderabbit, major) loadPreloads + loadEntryPointForTestRunner could
   now return with pending_internal_promise, silently advancing to the
   next preload / reporting 0 tests instead of waiting. Revert those
   paths — preloads/testrunner keep the old 'wait until resolved'
   contract. Only loadEntryPoint (the 'bun run <script>' path) gets the
   loop-exit escape.

2. (claude, critical) isEventLoopAlive() short-circuits on
   unhandled_error_counter != 0. A side-path unhandled rejection during
   a TLA wait would cause waitForPromiseOrLoopExit to bail while the
   actual work (setTimeout, fetch, etc.) was still in flight — the
   continuation after 'await' would never run. Replace the check with a
   new EventLoop.hasAnyHandleWork() helper that only considers handle
   liveness, not error state. waitForPromise callers (Expect, bundler,
   html_rewriter, macros, REPL) are untouched — their 'returns with
   promise resolved' contract is still load-bearing.

Also revert the waitForPromiseWithTermination break from the prior
commit; workers aren't exercising the reporter's bug.

Adds a regression test pinning the mid-TLA-unhandled-rejection
behavior so a future someone doesn't reintroduce the short-circuit.
The fix function is EventLoop.waitForPromiseOrLoopExit;
loadEntryPoint just calls it. waitForPromise itself is unchanged.
A hang regression would produce exitCode=143 (test-timeout SIGTERM) and
elapsed>=30_000. Asserting exit code first short-circuits and the diff
reads as "expected 0, received 143" — technically a failure but useless
for spotting the actual regression (the timer). Swap the order on the two
timing-sensitive tests so the elapsed check fires first.
Closes a narrow race where a thread pushes to concurrent_tasks without a
corresponding ref (e.g. hot_reloader.zig:278, where the Watcher thread
has no KeepAlive of its own and relies on the loop's forever_timer for
liveness). Between tick() draining concurrent_tasks and the check, a
fresh push could be missed if all other liveness signals happened to be
zero.

concurrent_tasks.isEmpty() is a single atomic acquire-load on the back
pointer (unbounded_queue.zig:165), safe to call from the JS thread
without locking. Changes the hasAnyHandleWork receiver from *const to
*EventLoop because isEmpty takes *Self.
autoTick ends with handleRejectedPromises(), which runs user JS. In
.bun unhandled-rejection mode with a registered handler, the path
through Bun__handleUnhandledRejection returns WITHOUT a 'defer
drainMicrotasks' (unlike .none / .warn / .strict). If that handler
resolves the TLA-awaited promise, the continuation is queued as a JSC
microtask that hasn't run yet when autoTick returns.

hasAnyHandleWork() inspects uv/uws handles, task queues, refs, and
immediates — none of which see JSC microtasks. Without a drain between
autoTick and the liveness check, the loop breaks with the promise
pending, silently dropping the continuation.

Added a regression test that pins this: a TLA awaits a promise that is
only resolved via process.on('unhandledRejection') firing inside
autoTick. Without the drain it hits the break and exits with empty
stdout; with the drain it prints 'got=handled' and exits 0.
Mirrors the fix in 71dfe8b for the non-watcher path. autoTick's
handleRejectedPromises() can run a user process.on('unhandledRejection')
handler that resolves the TLA promise and queues a continuation
microtask; hasAnyHandleWork inspects handles/tasks/refs/immediates but
not JSC microtasks, so the break could fire before the continuation
runs.

In practice the outer bun.js.zig watcher loop (while
isEventLoopAlive()) drains microtasks after this returns, so the
continuation survives even without the drain here. But the symmetry
with waitForPromiseOrLoopExit is worth preserving — every caller of
hasAnyHandleWork now drains first, so the assumption 'if the check
says no work, there is genuinely no pending JS to run' holds at the
call site rather than relying on downstream machinery.
Comment thread src/jsc/VirtualMachine.zig
…ers progress

The in-tick HotReloadTask path drains microtasks via tick()'s
else-branch after the task returns; the new module's loader chain
(fetch → link → evaluate) runs inline and the module body is
evaluated before reload() returns. Callers *outside* tick() don't get
that drain:

  - reportExceptionInHotReloadedModuleIfNeeded()'s stranded-flag
    consumer (5215594)
  - the pre-existing bun.js.zig initial-load rejection path

They would return to the main loop with the new module still unset,
tickPossiblyForever() would then block in loop.tick() before
this.tick() ever ran, and the reload would only take effect on the
next loop wakeup (another save, or forever_timer firing in up to 4
minutes). User-visible symptom: observably identical to 'save silently
dropped, recovers on next save' for the two narrow stranded-flag
triggers 5215594 was meant to close.

Drain once after reloadEntryPoint() returns so every call site —
in-tick, reportException…, bun.js.zig — gets identical semantics.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found in this revision, but this reworks core event-loop liveness + TLA exit semantics and the --hot reload state machine (with Windows-specific direct reads of libuv internals), so it needs a human sign-off — also worth confirming the Windows test-http-should-emit-close-when-connection-is-aborted.ts timeouts in Build #51503 are pre-existing flakes and not related to the loop changes here.

Extended reasoning...

Overview

Touches three files: src/jsc/event_loop.zig (adds hasAnyHandleWork, hasAnyHandleWorkIgnoringForeverTimer, waitForPromiseOrLoopExit), src/jsc/VirtualMachine.zig (routes loadEntryPoint's TLA wait through the new helper, reworks reload() and reportExceptionInHotReloadedModuleIfNeeded for the abandoned-TLA case under --hot, adds a post-reloadEntryPoint microtask drain), and a new 256-line regression test. The change alters when a Bun process decides to exit vs. keep waiting on a top-level await — core process-lifecycle semantics.

Security risks

None identified. No auth/crypto/permissions surface; the change is internal event-loop scheduling. The behavioral change (exit instead of wait on unref'd handles) matches Node.js semantics and is not a security boundary.

Level of scrutiny

High. This is production-critical runtime code: the event loop's liveness predicate and the entry-point promise-wait loop affect every bun run / bun -e invocation. The Windows branch of hasAnyHandleWorkIgnoringForeverTimer reads libuv internal struct fields (active_handles, active_reqs.count, pending_reqs_tail, endgame_handles) directly and applies a heuristic subtraction for forever_timer. The --hot reload changes alter when the module registry is torn down vs. deferred. A prior attempt at this same fix (#27215) caused test-runner regressions, and this PR went through ~15 rounds of automated review correcting subtle issues (the unhandled_error_counter short-circuit, worker-TLA revert, Windows forever_timer masking, zombie-continuation ordering, stranded hot_reload_deferred, out-of-tick() microtask drain). All of those are now resolved, but the iteration count itself signals the subtlety.

Other factors

  • All prior inline review comments are marked resolved and the latest commit (e86f538) addresses the last one.
  • Build #51503 (one commit back, 5215594) shows test-http-should-emit-close-when-connection-is-aborted.ts timing out on three Windows lanes plus a Windows build-agent infra failure. loadEntryPointForTestRunner was deliberately left unchanged so this is plausibly a pre-existing flake, but given the PR does change Windows event-loop liveness it's worth a human confirming before merge.
  • Good test coverage for the specific scenarios fixed (six regression tests including a --hot round-trip test), and the PR description documents re-running the related TLA/timer regression suites.

This is not a candidate for auto-approval — it's a substantive runtime behavior change that warrants maintainer review.

Same pre-existing Windows flakes as builds 51163, 51423, 51448, 51473.
Hot test sourcemap generation flakes at ~5% rate independently of
this PR (verified at e86f538, 5215594, and main).

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No bugs found in the current revision, but this reworks core event-loop liveness + the --hot reload deferral path (including Windows-specific reads of libuv loop internals) across enough subtle iterations that a human pass on the final shape is warranted.

Extended reasoning...

Overview

This PR changes how Bun handles unsettled top-level await in the main entry point, plus a substantial rework of the --hot reload deferral path that fell out of it. Touched surfaces:

  • src/jsc/event_loop.zig: new hasAnyHandleWork(), hasAnyHandleWorkIgnoringForeverTimer() (Windows branch reads uv_loop_t internal fields directly: active_handles, active_reqs.count, pending_reqs_tail, endgame_handles), and waitForPromiseOrLoopExit().
  • src/jsc/VirtualMachine.zig: loadEntryPoint watcher-loop break, reload() .pending arm restructured (drain → re-switch → liveness gate), a new .pending consumer in reportExceptionInHotReloadedModuleIfNeeded, and a post-reloadEntryPoint microtask drain.
  • New regression test with 6 cases including a --hot round-trip test.

Security risks

None identified — this is scheduling/liveness logic with no auth, parsing, or external-input surface.

Level of scrutiny

High. This is the core event-loop tick path and the --hot reload state machine — code that runs on every bun run and every file save under --hot. The PR went through ~14 commits responding to a long chain of subtle ordering/race findings (unhandled-error-counter short-circuit, concurrent_tasks race, Windows forever_timer keeping uv_loop_alive true, microtask-drain placement relative to liveness checks, stranded hot_reload_deferred under coalesced watcher events). Each individual fix looks correct in isolation, but the cumulative interaction between reload(), reportExceptionInHotReloadedModuleIfNeeded(), loadEntryPoint's watcher loop, and tickPossiblyForever() is now intricate enough that a maintainer should sign off on the final shape — particularly the Windows branch that subtracts forever_timer from raw uv_loop_t.active_handles and the decision to drain microtasks at the tail of reload() for all call sites.

Other factors

  • All prior inline findings are addressed and resolved; no open bugs in the current revision.
  • Good regression coverage (6 tests pinning the TLA exit, the unhandled-rejection edge, and the --hot reload-after-abandoned-TLA path), though the most contorted edge cases (coalesced watcher events, ref'd→abandoned transition) are not directly tested.
  • A prior attempt at this same fix (#27215) was reverted for test-runner regressions, which is why this version deliberately leaves waitForPromise / loadEntryPointForTestRunner / loadPreloads untouched — that scoping decision itself is worth a human nod.

@robobun robobun force-pushed the farm/ed4a7fa6/fix-abortsignal-timeout-tla-hang branch from 630731d to 52a136f Compare May 5, 2026 04:19
Hot reload 'should work with sourcemap generation' flakes at ~5% on
windows-11-aarch64 under qemu: the 10s per-test timeout fires before
50 hot-reload cycles complete, independent of this PR.

Cross-PR evidence from the last 24h:
  - #28645 retriggered 3 times for the same test
  - #29945, #30018, #30026 all retriggered
  - #29587 build 51335: documented as 'known hot-reload flake'
  - #29735 open: test-level workaround in progress
  - #29740 merged: runtime fix for a related --hot sourcemap race

Local repro: identical ~5% rate at my tip (e86f538), at 5215594
(before the stranded-defer fix), and at 0ce2096 (before the
forever-timer carve-out). Rate unchanged across the three, confirming
the flake is not in this PR's footprint.

Same persistent test-http-should-emit-close flake on 2019 x64-baseline
(seen in 51163, 51423, 51448, 51473, 51503, 51528, 51604).
Comment thread test/regression/issue/29546.test.ts

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new issues from the bug hunter on this revision — but this rewires core event-loop liveness (TLA exit semantics, reload() deferral, platform-specific libuv internals on Windows), so it warrants a human pass before merge.

Extended reasoning...

Overview

This PR changes how Bun handles unsettled top-level await in the main entry point. It touches three areas:

  • src/jsc/event_loop.zig: adds hasAnyHandleWork(), hasAnyHandleWorkIgnoringForeverTimer() (which on Windows reads raw libuv loop fields — active_handles, active_reqs.count, pending_reqs_tail, endgame_handles — and subtracts the forever_timer contribution), and waitForPromiseOrLoopExit().
  • src/jsc/VirtualMachine.zig: routes loadEntryPoint's TLA wait through the new helper (both watcher and non-watcher branches), rewrites reload()'s .pending deferral to distinguish abandoned TLAs from ref'd-I/O TLAs, adds a stranded-flag consumer in reportExceptionInHotReloadedModuleIfNeeded, and adds a post-reloadEntryPoint microtask drain.
  • test/regression/issue/29546.test.ts: six regression tests covering the exit, the keep-alive guard, the isEventLoopAlive vs hasAnyHandleWork distinction, the unhandledRejection-handler microtask race, and --hot reload after an abandoned TLA.

Security risks

None identified. This is process-lifetime / event-loop scheduling logic; no auth, crypto, parsing of untrusted input, or permissions surface is touched.

Level of scrutiny

High. This is production-critical runtime code that determines when bun run exits and how --hot decides to reload. The change history on this PR (15+ iterations, each fixing a subtle race or platform divergence found in the previous one) demonstrates exactly how delicate the liveness predicate is. The Windows branch of hasAnyHandleWorkIgnoringForeverTimer reaches directly into libuv's uv_loop_t struct layout and applies a heuristic subtraction — that alone deserves a maintainer's eyes. The reload() rewrite changes when the module registry is torn down relative to in-flight TLA continuations, which is a meaningful semantic shift for --hot.

Other factors

  • All prior inline review comments (mine, CodeRabbit's) are now resolved; the accidental-revert commit (630731d) flagged in my last review has been removed from history and the source changes are restored.
  • The bug-hunting system found no new issues on this revision.
  • CI on the latest commit shows only infrastructure failures (BuildKite agent creation / fleet issues), not test failures — but that also means the latest revision hasn't yet had a clean full-matrix run, including the Windows lanes that exercise the libuv-specific path.
  • A prior attempt at this fix (#27215) caused test-runner regressions; this PR carefully scopes the new behavior to loadEntryPoint only, but the interaction surface (hot-reload deferral, forever_timer, microtask draining order) is broad enough that a human should sign off on the design.

This is well outside the "simple, mechanical, obvious" bar for auto-approval.

Comment thread src/jsc/VirtualMachine.zig
The stranded-`hot_reload_deferred` consumer at the `.pending` arm
calls `reload()`, whose in-`reload()` microtask drain now settles
the NEW `pending_internal_promise`. But `var promise` at the top
of the function was captured from the OLD `pending_internal_promise`
— if the new module body synchronously rejects (top-level `throw`,
transpile error), the fresh `.rejected` is missed because the outer
`switch` dispatches on the stale capture and the bare `return` at
the end of the `.pending` arm fires before the `.rejected` arm can
run. `tickPossiblyForever` then blocks in `loop.tick()` before the
next main-loop iteration re-fetches, so the rejection sits unreported
until another save or `forever_timer`.

Tail-recurse after the reload so the next frame re-reads
`this.pending_internal_promise` at its own line 720 and dispatches
on the fresh status. One-line fix, zero risk (the recursion can only
go one level deeper before the flag consumer guard stops it).

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No remaining issues from my side — every prior comment has been addressed and the latest commit picks up the tail-recurse fix. I'm not auto-approving only because this touches the core event-loop liveness predicate, loadEntryPoint, and the --hot reload state machine (with Windows-specific raw uv-loop field reads), which feels like it warrants a human sign-off.

Extended reasoning...

Overview

This PR changes how Bun handles an unsettled top-level await in the main entry point so the process exits cleanly (matching Node.js) instead of busy-spinning on POSIX or hanging forever on Windows. It touches three files:

  • src/jsc/event_loop.zig: adds hasAnyHandleWork(), hasAnyHandleWorkIgnoringForeverTimer() (Windows reads raw uv_loop_t fields: active_handles, active_reqs.count, pending_reqs_tail, endgame_handles), and waitForPromiseOrLoopExit().
  • src/jsc/VirtualMachine.zig: routes loadEntryPoint's TLA wait through the new variant; reworks reload()'s .pending arm to distinguish abandoned TLAs from in-flight ref'd I/O; adds a stranded-hot_reload_deferred consumer in reportExceptionInHotReloadedModuleIfNeeded with a tail-recurse to re-read the fresh promise; adds a post-reloadEntryPoint microtask drain.
  • test/regression/issue/29546.test.ts: six regression tests including a --hot watcher round-trip test.

Security risks

None identified. No auth, crypto, permissions, or untrusted-input parsing is involved; the change is purely event-loop scheduling/liveness logic.

Level of scrutiny

High. loadEntryPoint and EventLoop.autoTick/waitForPromise are on the hot path of every bun run / bun -e invocation, and the --hot reload state machine has subtle ordering invariants (defer vs. proceed, microtask drain timing, Windows forever_timer ref accounting). The PR went through ~15 iterative review rounds where real correctness issues were found and fixed at each step (unhandled-error-counter short-circuit, concurrent_tasks race, microtask-drain-before-liveness ordering, Windows forever_timer keeping uv_loop_alive permanently true, zombie-continuation regression in reload(), stranded-defer triggers, accidental source revert in a "ci: retrigger" commit). All of those are now resolved and pinned by tests, but the density of subtle interactions argues for a maintainer's eyes on the final shape rather than bot-only approval.

Other factors

  • All 15+ prior inline comments (mine and CodeRabbit's) are marked resolved; the most recent commit (360dfa4) directly implements my last suggestion (tail-recurse after reload()).
  • The bug-hunting system found no issues on this revision.
  • Test coverage is good: the six new tests pin the primary fix, the regression guard, the isEventLoopAlive vs hasAnyHandleWork distinction, the microtask-drain-before-break, and the --hot reload-after-abandoned-TLA path.
  • This is not a mechanical/config change — it's a semantic change to runtime exit behavior and the hot-reload defer/proceed decision, which is exactly the kind of design surface a human should ratify.

Build 51653 had 13 jobs expire due to BuildKite agent outage
(Expired status, not test failures). Plus the usual persistent flakes:
  - test-http-should-emit-close (Windows, all 3 lanes)
  - hot.test.ts sourcemap generation (documented cross-PR flake)
  - jsc-stress mprotect failed: 487
  - bun-install-registry hoisting peers
  - s3-storage-class UnknownError
  - node-http-uaf EADDRINUSE (port 0 race)
  - spawn.test.ts timeout

None of these are caused by this PR — the core fix (loadEntryPoint
TLA exit via waitForPromiseOrLoopExit, reload() defer/proceed gate
via hasAnyHandleWorkIgnoringForeverTimer) is unchanged and all its
regression tests pass.
Comment thread src/jsc/VirtualMachine.zig Outdated
…30217)

Build 51679: only ERROR is test-http-should-emit-close, the same
Windows flake that's been hitting every build on this PR since
51163. Scan of 80 recent buildkite builds found this test failing
on 67 unrelated PRs with identical signature (timeout on Windows
only — 11 aarch64, 2019 x64, 2019 x64-baseline).

Multiple commits on main explicitly document it as '~100% flake on
main', not caused by any single PR:
  - 5807f25, 30ba155, 087b80c, 9367f69, c53835a,
    3eb1f30

Dedicated fix attempt in #30217 identified the root cause as a
Windows AFD-poll race in HTTP socket close (NodeHTTPResponse.doPause
TODO), unrelated to TLA runtime. #30217 was closed; flake persists.

All other CI issues on this PR are retried-and-passed flakes
(jsc-stress, AsyncLocalStorage, bun-install-registry).

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new issues found on this pass, but this rewires core event-loop liveness (TLA exit semantics, --hot reload's .pending deferral, Windows-specific uv handle accounting) — worth a human sign-off given the blast radius and the number of iterations it took to converge.

Extended reasoning...

Overview

This PR changes how Bun handles unsettled top-level await in the main entry point. It adds three new EventLoop helpers (hasAnyHandleWork, hasAnyHandleWorkIgnoringForeverTimer, waitForPromiseOrLoopExit) in src/jsc/event_loop.zig, and routes loadEntryPoint through them instead of the blocking waitForPromise. It also substantially reworks the --hot reload path in src/jsc/VirtualMachine.zig: reload()'s .pending arm now drains microtasks and consults a forever-timer-aware liveness check before deferring, reportExceptionInHotReloadedModuleIfNeeded gains a stranded-flag consumer with tail-recursion, and reload() drains microtasks after reloadEntryPoint for out-of-tick() callers. A 256-line regression test file covers six scenarios including a --hot round-trip test.

Security risks

None identified. No auth, crypto, permissions, or untrusted-input parsing is touched. The change is purely about event-loop scheduling and process-exit timing.

Level of scrutiny

High. This is the core runtime event loop — the code that decides whether a Bun process keeps running or exits. The change has platform-divergent behavior (POSIX uws active counter vs. Windows uv_loop_alive / active_handles / active_reqs), interacts with hot-reload state machines, and introduces a new exit state (loadEntryPoint returning with a still-.pending promise) that downstream code was not originally designed for. The Windows hasAnyHandleWorkIgnoringForeverTimer heuristic directly reads libuv internals (active_handles, active_reqs.count, pending_reqs_tail, endgame_handles) and subtracts an assumed contribution — the in-code comment explicitly calls it a heuristic.

Other factors

The PR went through roughly twenty review iterations over two weeks, with multiple substantive bugs found and fixed along the way (using isEventLoopAlive instead of raw handle liveness, missing concurrent_tasks check, missing microtask drain before the liveness break, the forever_timer Windows issue, the ref'd-vs-abandoned TLA distinction in reload(), an accidental full revert in a "ci: retrigger" commit, etc.). All inline comments are resolved except one trivial comment-wording nit about a hardcoded line number. An earlier attempt at this same fix (#27215) caused test-runner regressions, which is why the scope was carefully narrowed to loadEntryPoint only. Given the subtlety, the platform-specific reasoning, and the history, this should get human eyes before merge.

Hardcoded line references drift the moment anything is inserted
above the function, and VirtualMachine.zig is 2500+ lines and
actively churning. Naming the variable is stable under refactor
and equivalent in clarity.
Comment thread src/jsc/event_loop.zig

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any new issues, but this changes core event-loop liveness semantics (waitForPromiseOrLoopExit, hasAnyHandleWork*) plus the --hot reload state machine, with platform-specific libuv internals on Windows — worth a human sign-off given how many subtle interactions surfaced during iteration.

Extended reasoning...

Overview

This PR changes how Bun's main entry point waits on a top-level await: instead of spinning/hanging when only unref'd work is pending, loadEntryPoint now routes through a new EventLoop.waitForPromiseOrLoopExit that breaks once hasAnyHandleWork() reports nothing left to make progress. It also adds hasAnyHandleWorkIgnoringForeverTimer() (Windows-specific carve-out reading raw uv_loop.active_handles / active_reqs.count / pending_reqs_tail / endgame_handles), reworks reload()'s .pending deferral and reportExceptionInHotReloadedModuleIfNeeded() so --hot recovers after an abandoned TLA, and adds a post-reloadEntryPoint microtask drain. Six regression tests cover the main fix plus the unhandledRejection/--hot edge cases.

Security risks

None identified. This is event-loop control-flow / process-liveness logic; no auth, crypto, parsing of untrusted input, or permission surfaces are touched.

Level of scrutiny

High. src/jsc/event_loop.zig and src/jsc/VirtualMachine.zig are core runtime — the predicate that decides whether the process exits, and the --hot reload state machine. The Windows branch of hasAnyHandleWorkIgnoringForeverTimer reads libuv internals directly and subtracts a heuristic forever_timer contribution. The PR went through ~20 review iterations in which the bug-hunting system found several real issues (the isEventLoopAlive short-circuit on unhandled_error_counter, missing concurrent_tasks check, missing microtask drain before the liveness break, the Windows forever_timer ref'd-handle problem, the zombie-continuation regression in reload()'s .pending fall-through) — all addressed, but the density of subtle interactions argues for a maintainer who knows the event-loop invariants to sign off.

Other factors

  • An earlier attempt at this fix (#27215) caused test-runner regressions; this PR explicitly scopes the new behavior to loadEntryPoint only and leaves waitForPromise / loadEntryPointForTestRunner / waitForPromiseWithTermination unchanged, which is the right call but means a maintainer should confirm the scoping.
  • One open inline nit remains (the beforeExit follow-up at event_loop.zig:700) — explicitly labeled non-blocking / out-of-scope.
  • CI build #51693 has test-http-should-emit-close-when-connection-is-aborted.ts timing out on three Windows lanes; commit messages reference this as a known cross-PR flake (#30217), but a human should confirm it's unrelated before merge.
  • This is a user-visible behavior change (unsettled TLA now exits instead of waiting), aligning with Node.js but still a semantic shift.

…mplete

Before this PR, an unsettled top-level await hung forever inside
`waitForPromise` and never reached `onBeforeExit`. With the new
`waitForPromiseOrLoopExit` bail-out, an abandoned TLA now flows into
`onBeforeExit`, which emits `process.on('beforeExit', …)`. A handler
that synchronously resolves the TLA queues the continuation as a JSC
microtask — but `isEventLoopAlive()` checks handles/tasks/refs, not
microtasks, so the inner `while` below would skip and the
continuation would be silently dropped before process exit.

Drain microtasks after `dispatchOnBeforeExit` (and after the
re-dispatch inside the outer loop, for symmetry) so a resolving
beforeExit handler's continuation runs before liveness is rechecked.
Same pattern the new `waitForPromiseOrLoopExit` uses for
`unhandledRejection` handlers — this is the sibling fix one frame
later in the exit sequence.

Test: `beforeExit handler that resolves the TLA runs to completion`
pins it. Without the drain the test fails (got=done is never printed);
with it, the continuation runs and stdout matches.
Comment thread src/jsc/VirtualMachine.zig Outdated
Comment thread src/jsc/event_loop.zig
///
/// `concurrent_tasks.isEmpty()` is a single atomic acquire-load; it's safe
/// from the JS thread without locking. Including it closes a narrow race
/// where a thread (e.g. the hot-reloader watcher at `hot_reloader.zig:278`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 nit: the doc-comment on hasAnyHandleWork() references "the hot-reloader watcher at hot_reloader.zig:278" — a hardcoded cross-file line number that will drift on any edit above that line. This is the same drift hazard you already accepted and fixed in this PR for the "line 720" reference in VirtualMachine.zig (commit 93e2313). For consistency, replace with a stable symbolic reference like "in HotReloader.Task.enqueue()".

Extended reasoning...

What the issue is. The new doc-comment on hasAnyHandleWork() (src/jsc/event_loop.zig:591) reads: "Including it closes a narrow race where a thread (e.g. the hot-reloader watcher at hot_reloader.zig:278) pushes to the concurrent queue without bumping concurrent_ref…". The hot_reloader.zig:278 reference is a hardcoded cross-file line number embedded in a doc-comment. Today line 278 is that.reloader.enqueueTaskConcurrent(&that.concurrent_task); inside HotReloader.Task.enqueue(), so the reference is accurate — but it will silently go stale the moment anything is inserted or removed above that line in hot_reloader.zig.

Why this is worth a one-line fix in this PR. This is exactly the same drift hazard that was already flagged and fixed in this very PR. Resolved inline-comment 3187088494 pointed out that "the stale capture at line 720 above" in VirtualMachine.zig would rot, and commit 93e2313 ("comment: refer to the local promise by name, not line number") changed it to "the stale local promise above". The author has demonstrated agreement that hardcoded line numbers in comments are a maintenance hazard worth fixing. Consistency within the same PR argues for the same treatment here — and a cross-file line reference is strictly more brittle than a same-function one, since hot_reloader.zig churns independently of event_loop.zig and there is no "above" hint to fall back on.

Why existing conventions don't excuse it. A grep of src/jsc/*.zig shows this is the only file.zig:NNN cross-file doc-comment reference in the directory, so it is not an established convention being followed — it's a one-off introduced by this PR.

Step-by-step proof of the drift.

  1. Today: hot_reloader.zig:278 is that.reloader.enqueueTaskConcurrent(&that.concurrent_task); inside Task.enqueue(). The doc-comment in event_loop.zig:591 says "at hot_reloader.zig:278". Correct.
  2. A future PR adds a single field, import, or helper function anywhere in hot_reloader.zig lines 1–277. Every line below shifts down by 1.
  3. hot_reloader.zig:278 is now whatever the old line 277 was — perhaps that.next = null; or a closing brace. The doc-comment in event_loop.zig still says "at hot_reloader.zig:278".
  4. A reader following the comment opens hot_reloader.zig to line 278, finds something unrelated to enqueueTaskConcurrent, and is left guessing what the "e.g." was supposed to point at.
  5. No compiler, linter, or formatter will ever catch this. The comment is never updated because nothing enforces it.

Impact. Purely cosmetic — no runtime effect whatsoever. The reference lives inside an "e.g." parenthetical, so it isn't load-bearing for understanding the function's contract. This should not block the PR.

How to fix. Replace "the hot-reloader watcher at hot_reloader.zig:278" with "the hot-reloader watcher in HotReloader.Task.enqueue()" (or just "the hot-reloader watcher's enqueueTaskConcurrent"). The function name is a stable symbolic anchor that survives line-number churn, and if it's ever renamed, grep/IDE will surface every reference. One-word edit, zero risk.

…tions

207afb3 tried to fix a nit claude-bot explicitly labeled non-blocking
/ out-of-scope: that `onBeforeExit` doesn't drain microtasks after
`dispatchOnBeforeExit`, so a `beforeExit` handler that resolves the
abandoned-TLA promise has its continuation dropped. I added a drain
(and a regression test) and it broke
`test/js/node/test/parallel/test-event-capture-rejections.js` on
Windows aarch64 (Node compatibility test, upstream code we can't
modify — verified 0/10 pass locally after the change, 10/10 before).

The drain causes a microtask to fire during onBeforeExit that
triggers one of the `process.on('unhandledRejection', err => throw
err)` handlers the test installs, producing the `error: kaboom` at
line 80:55 in CI build 51788.

Revert the onBeforeExit change and drop the regression test. The
beforeExit-resolves-TLA case remains a follow-up (as the bot
originally labeled it), to be investigated separately without
perturbing the Node compatibility surface.
@robobun

robobun commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

Closing since #29546 was closed as not-reproducible on 1.3.13 Linux.

The bug IS reproducible — I verified it end-to-end on this branch:

  • POSIX: busy-loops at ~100% CPU for the full AbortSignal.timeout duration (from autoTick taking the tickWithoutIdle zero-timeout poll branch)
  • Windows: hangs forever (uv_run(UV_RUN_NOWAIT) early-returns when uv__loop_alive() is false, so the unref'd uv timer that drains Bun's own timer heap never fires)

The fix routes loadEntryPoint's TLA wait through a new waitForPromiseOrLoopExit that breaks when nothing in the event loop can progress the promise — matches Node.js semantics. Also closes #14951 (await new Promise(r => {}) same 100% CPU busy-loop).

If the issue gets reopened with a fresh repro, happy to rebase this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants