Fix worker.unref() causing pending messages to be lost#28645
Conversation
|
Updated 1:05 PM PT - May 19th, 2026
❌ Your commit 🧪 To try this PR locally: bunx bun-pr 28645That installs a local version of the PR into your bun-28645 --bun |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughParent event-loop reference accounting added for worker message delivery: scheduling now refs/unrefs the parent loop per message, pending-parent refs are tracked and unwound on failure or worker exit, and regression tests covering unref semantics were added. Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/bindings/webcore/Worker.cpp`:
- Around line 251-269: The parent event-loop ref acquired via
parentContext->refEventLoop() can leak if postTaskToWorkerGlobalScope(...) never
runs; wrap that ref in an RAII guard object (e.g., ParentEventLoopRefGuard) or
similar token captured by the worker task so destruction releases the ref, and
ensure the guard's destructor posts the unref back to the parent via
ScriptExecutionContext::postTaskTo(parentContextId, ...) calling
parentCtx.unrefEventLoop(); alternatively, check postTaskToWorkerGlobalScope()
return/failure and on failure immediately postTaskTo(parentContextId, ...) to
run unrefEventLoop(); modify the lambda capture to take ownership of the
guard/token so dropped work automatically calls unrefEventLoop() and remove the
existing lone direct unref only from successful dispatch paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b9b7866a-e91e-48e8-9689-17d34fbb72d6
📒 Files selected for processing (2)
src/bun.js/bindings/webcore/Worker.cpptest/regression/issue/28643.test.ts
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/bindings/webcore/Worker.cpp`:
- Around line 462-474: dispatchExit() can call unrefEventLoop() for refs already
transferred by fireEarlyMessages() or drainEvents() because
m_pendingTasksParentLoopRefs may be stale; instead compute leakedRefs from the
actual pending tasks at exit (or ensure the counter is decremented when tasks
are moved). Update the block that captures leakedRefs to inspect m_pendingTasks
under m_pendingTasksMutex (or update the code paths in
fireEarlyMessages()/drainEvents() to decrement m_pendingTasksParentLoopRefs when
moving work) so that the value passed into ScriptExecutionContext::postTaskTo
reflects only refs still held by m_pendingTasks; keep references to
m_pendingTasksParentLoopRefs, m_pendingTasks, dispatchExit(),
fireEarlyMessages(), drainEvents(), and the posted lambda calling
context.unrefEventLoop() in your changes.
- Around line 271-280: The code decides whether to queue (append to
m_pendingTasks and increment m_pendingTasksParentLoopRefs) or to post
(ScriptExecutionContext::postTaskTo) before taking m_pendingTasksMutex, which
races with dispatchOnline()/fireEarlyMessages() that flip OnlineFlag and drain
the queue; fix by grabbing m_pendingTasksMutex around the check/update of
m_onlineClosingFlags/OnlineFlag and the queue append/post decision so the
transition and enqueue are serialized: i.e., inside the Locker for
m_pendingTasksMutex check (m_onlineClosingFlags & OnlineFlag), if still offline
do the m_pendingTasksParentLoopRefs++ and append, otherwise call
ScriptExecutionContext::postTaskTo and, on failure, call
parentContext->unrefEventLoop(); ensure no ordering of operations related to
m_pendingTasks or OnlineFlag happens outside that lock.
In `@test/regression/issue/28643.test.ts`:
- Around line 79-84: The test attaches the 'error' handler too late and can miss
the worker bootstrap failure; update the sequence around the Worker instance
(variable w) so that w.on('error', ...) is registered immediately after creating
the Worker (and before calling w.postMessage('go')), leaving w.unref() placement
flexible, to ensure the missing-script failure is caught by the handler rather
than becoming an unhandled worker error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9e493ff7-ec52-4c59-a7fa-75b66a462d76
📥 Commits
Reviewing files that changed from the base of the PR and between 3d80b42 and 42279d562c5a7e350909d2b631ff4a790c8ad76d.
📒 Files selected for processing (3)
src/bun.js/bindings/webcore/Worker.cppsrc/bun.js/bindings/webcore/Worker.htest/regression/issue/28643.test.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/bindings/webcore/Worker.cpp`:
- Around line 709-719: The lambda passed to ScriptExecutionContext::postTaskTo
captures an unused/shadowed variable "ports"
(messageWithMessagePorts.transferredPorts was already moved and the lambda
declares a new local auto ports via MessagePort::entanglePorts), so remove
"ports" from the lambda capture list (leave message = messageWithMessagePorts
and protectedThis = Ref{*worker} intact) to avoid capturing an empty/shadowed
variable and keep the local auto ports declaration inside the lambda as-is.
In `@test/regression/issue/28643.test.ts`:
- Line 4: Add an explicit timeout option to the test call that spawns worker
processes so IPC delays under ASAN/CI don't cause flaky failures: update the
test("worker.unref() does not prevent receiving pending messages", async () => {
... }, ...) invocation to pass a third argument like { timeout: 15000 } to the
test() call so the test runner uses a 15s timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b1b934e4-1a3f-40ac-b45c-320d314c4b72
📥 Commits
Reviewing files that changed from the base of the PR and between 42279d562c5a7e350909d2b631ff4a790c8ad76d and f912b8fe602c6fa9c2d3714a17d99320b7944bea.
📒 Files selected for processing (2)
src/bun.js/bindings/webcore/Worker.cpptest/regression/issue/28643.test.ts
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/bun.js/bindings/webcore/Worker.cpp (1)
376-382:⚠️ Potential issue | 🟡 MinorRemove
drainEvents()- it is unused dead code.
drainEvents()is declared as public but never called anywhere in the codebase. Since it isn't used, the theoretical ref-counting concern is moot. The method should be removed entirely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/bun.js/bindings/webcore/Worker.cpp` around lines 376 - 382, Remove the unused public method Worker::drainEvents and its implementation: delete the method declaration and definition (which references m_pendingTasks, m_pendingTasksMutex and calls postTaskToWorkerGlobalScope(WTF::move(task))); ensure no remaining references to drainEvents remain and that m_pendingTasks/m_pendingTasksMutex usage is unchanged elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/bun.js/bindings/webcore/Worker.cpp`:
- Around line 376-382: Remove the unused public method Worker::drainEvents and
its implementation: delete the method declaration and definition (which
references m_pendingTasks, m_pendingTasksMutex and calls
postTaskToWorkerGlobalScope(WTF::move(task))); ensure no remaining references to
drainEvents remain and that m_pendingTasks/m_pendingTasksMutex usage is
unchanged elsewhere.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4b453d22-e7a1-4fd4-8115-2744b0c78380
📥 Commits
Reviewing files that changed from the base of the PR and between f912b8fe602c6fa9c2d3714a17d99320b7944bea and 90696ca1fe2102bf2d62758792f04eef9de78c3d.
📒 Files selected for processing (2)
src/bun.js/bindings/webcore/Worker.cpptest/regression/issue/28643.test.ts
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟣
src/bun.js/bindings/webcore/Worker.cpp:717-738— dispatchErrorWithMessage() (line 444) and dispatchErrorWithValue() (line 460) both call ScriptExecutionContext::postTaskTo() to post error events to the parent WITHOUT any refEventLoop()/unrefEventLoop() wrapping, leaving the same race the PR set out to fix: if a worker is unref()'d and no postMessage calls have been made, the parent event loop can exit before the error task runs and silently drop the 'error' event. This is a pre-existing issue — the PR did not modify these functions — but the PR's fix is incomplete in that jsFunctionPostMessage and Worker::postMessage were correctly updated while the sibling error-dispatch paths were left unprotected; the existing test 3 avoids exposing this because it calls w.postMessage('go') which acquires a ref through the newly-fixed messaging path.Extended reasoning...
The PR adds ref/unref accounting around two cross-thread message delivery paths to fix the race where an unref'd parent event loop could exit before processing a worker response. The fix is correct for Worker::postMessage (parent→worker) and jsFunctionPostMessage (worker→parent), but it does not extend to the error-dispatch methods.
Worker::dispatchErrorWithMessage (line 444) and Worker::dispatchErrorWithValue (line 460) both call ScriptExecutionContext::postTaskTo(ctx->identifier(), lambda) directly, without any refEventLoop() call before posting, and the lambda does not call unrefEventLoop() when it completes. Compare with the corrected jsFunctionPostMessage path, which calls context->refEventLoop() before posting and has the task call context.unrefEventLoop() on completion, with an explicit unref in the failure path too.
The specific code path that triggers the bug: a worker is created to load a nonexistent or malformed file, the parent calls w.unref(), and no postMessage calls are made (so m_pendingTasksParentLoopRefs remains 0 throughout). When the worker fails to initialize, WebWorker__dispatchError → dispatchErrorWithMessage calls postTaskTo without holding any ref on the parent. dispatchExit runs with leakedRefs=0 (no postMessage refs). If the parent event loop has no other active handles and has been unref'd, it may exit before the error task is scheduled or processed. The 'error' handler never fires.
The existing test 3 in 28643.test.ts avoids triggering this because the worker script is a nonexistent file but the test also calls w.postMessage('go') before unref-ing. That postMessage call goes through the newly-fixed Worker::postMessage path, which increments m_pendingTasksParentLoopRefs and calls refEventLoop(), keeping the parent alive long enough for the error event to fire. A user who does NOT call postMessage — which is a perfectly reasonable pattern for a simple fire-and-forget worker — will get silently dropped error events.
To address the refutation that argues this is pre-existing and not actionable: the bug predates this PR and the PR did not introduce it. However, the PR analyzed this exact class of bug (cross-thread liveness accounting) and produced a partial fix. Flagging the gap is appropriate because the PR's stated goal — preventing message/event loss on unref'd workers — is not fully accomplished. The fix for dispatchErrorWithMessage and dispatchErrorWithValue follows the same pattern already established in this PR: call parentContext->refEventLoop() before postTaskTo and call context.unrefEventLoop() inside the task (and unref immediately if postTaskTo returns false).
Step-by-step proof: (1) JS: const w = new Worker('./bad.js'); w.on('error', handler); w.unref(); (no postMessage). (2) m_pendingTasksParentLoopRefs = 0, concurrent_ref decremented to 0 by unref. (3) Worker fails: WebWorker__dispatchError → dispatchErrorWithMessage → postTaskTo(...) [no refEventLoop called]. (4) dispatchExit runs with leakedRefs=0, posts exit task. (5) Parent event loop sees hasPendingRefs()=false and no other handles — exits. (6) Error task and exit task are never processed. The 'error' handler never fires.
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
src/bun.js/bindings/webcore/Worker.cpp:448-469— In dispatchErrorWithValue, the PR calls ctx->refEventLoop() before postTaskTo, but if deserialization sets a JSC exception (e.g. OOM or heap termination), RETURN_IF_EXCEPTION(scope, ) causes an early void return from the lambda, skipping context.unrefEventLoop(). This leaves an unbalanced parent event loop ref, causing the parent process to hang indefinitely.Extended reasoning...
What the bug is: In dispatchErrorWithValue (Worker.cpp lines 448-469), the PR introduced ctx->refEventLoop() before the postTaskTo call and placed context.unrefEventLoop() at the very end of the lambda body. The pre-existing RETURN_IF_EXCEPTION(scope, ) macro at line 458 can cause an early void return from the lambda, bypassing context.unrefEventLoop().
The specific code path that triggers it: serialized->deserialize() is called with SerializationErrorMode::NonThrowing, but this mode does not guarantee no exception is set on the JSC VM. When CloneDeserializer::deserialize encounters an internal exception (such as OOM during property access or allocation), it returns SerializationReturnCode::ExistingExceptionError. maybeThrowExceptionIfSerializationFailed does NOT clear the exception for ExistingExceptionError; the scope exception register stays set. Then RETURN_IF_EXCEPTION(scope, ) fires, returning early from the lambda as a void return, skipping context.unrefEventLoop() at the bottom.
Why existing code does not prevent it: The if (!postTaskTo) failure branch only fires when postTaskTo itself returns false, meaning the parent ScriptExecutionContext was not found. It does not fire when the task was successfully posted but the lambda exits early via RETURN_IF_EXCEPTION. Since the task was delivered successfully, there is no other recovery path for the leaked ref.
What the impact would be: ctx->refEventLoop() was already called at line 451, incrementing the parent event loop's concurrent_ref counter. With the lambda returning early, concurrent_ref is never decremented. hasPendingRefs() returns true permanently, the parent event loop never exits, and the parent process hangs indefinitely. This is triggered by OOM, heap termination, or any JSC exception that propagates through the deserialize call.
How to fix it: Call context.unrefEventLoop() before RETURN_IF_EXCEPTION(scope, ) so it executes on all exit paths. Alternatively, use RAII (a scope-exit guard) to ensure unrefEventLoop() is called unconditionally regardless of how the lambda exits. The analogous dispatchErrorWithMessage function does not share this issue because its lambda contains no RETURN_IF_EXCEPTION.
Step-by-step proof:
- Worker thread calls dispatchErrorWithValue. ctx->refEventLoop() increments concurrent_ref to 1.
- postTaskTo succeeds (parent context found); the if (!postTaskTo) failure branch is NOT taken.
- Parent thread executes the lambda. serialized->deserialize() encounters OOM; the VM scope exception is set via ExistingExceptionError path.
- RETURN_IF_EXCEPTION(scope, ) fires at line 458 - early void return from the lambda.
- context.unrefEventLoop() at the end of the lambda is never reached.
- concurrent_ref stays at 1. hasPendingRefs() returns true. Parent event loop never exits. Process hangs indefinitely.
5b2249f to
51dc90b
Compare
b8f256a to
6981db9
Compare
There was a problem hiding this comment.
The atomic m_parentLoopRefs redesign in 8f17fa0 looks like it closes the ref-leak windows I flagged earlier (counter is incremented under the mutex alongside the ClosingFlag check, and dispatchExit atomically claims whatever's outstanding), but given how many iterations this concurrency logic has been through I'd like a human to sign off before it lands.
Extended reasoning...
Overview
This PR fixes #28643 by ref'ing the parent event loop while worker messages are in-flight so an unref()'d worker doesn't cause the parent to exit before the round-trip completes. It touches Worker.cpp/Worker.h (cross-thread ref accounting in postMessage, jsFunctionPostMessage, dispatchExit), adds a regression test, and allowlists one new aarch64 LSE outline-atomic helper introduced by the new std::atomic<uint32_t>.
What changed since my last review
My comment on 2026-04-22 flagged that the simplified design at dcd03ec/e24a506 had reverted earlier guards and was timing out test-worker-message-port-transfer-terminate.js / test-worker-safe-getters.js in CI. Commit 8f17fa0 replaces that with a third design: an atomic m_parentLoopRefs counter incremented under m_pendingTasksMutex (after a ClosingFlag gate) and drained by dispatchExit via exchange(0) under the same mutex. The task lambda does fetch_sub(1) and only posts unrefEventLoop if the prior value was > 0, so a ref is released exactly once by either the task or the exit-compensation loop. Because the lambda runs on the worker thread and dispatchExit runs on the same thread, the fetch_sub/exchange pair can't interleave in a way that double-unrefs; the post-exchange underflow is benign since ClosingFlag blocks further increments. This appears to address the post-online drop and pre-online race I raised.
Security risks
None — this is event-loop liveness bookkeeping, no auth/crypto/input-handling surface.
Level of scrutiny
High. This is hand-rolled cross-thread ref counting on a hot path (postMessage) where an off-by-one means either a process hang or premature exit. The PR has cycled through ~10 fix commits in response to race reports, including two acknowledged-but-deferred edge cases (post-online postTaskTo-succeeds-then-dropped, and the fireEarlyMessages async-batch crash window). The current design is the cleanest yet, but the iteration count alone is a strong signal that a maintainer familiar with the worker lifecycle should read it end-to-end.
Other factors
Build #47077 on 6026b68 was still in progress at the time the timeline was captured; CI results on the new design should be confirmed green (specifically the two node-compat tests that timed out on dcd03ec). The jsFunctionPostMessage change also includes an unrelated drive-by fix for a pre-existing releaseException() copy-paste bug.
There was a problem hiding this comment.
The atomic m_parentLoopRefs redesign in 8f17fa0 addresses the leak paths I flagged last round and the bug hunter found nothing new, but given the number of iterations on this ref-counting logic and that CI hasn't yet gone green on the new design, I'd like a human to sign off on the final approach before merge.
Extended reasoning...
Overview
This PR fixes #28643 (worker.unref() drops pending messages) by ref'ing the parent event loop around cross-thread message delivery in Worker::postMessage and jsFunctionPostMessage. The implementation has gone through ~10 design iterations in response to review feedback about ref-leak races. The latest commit (8f17fa0) introduces a new std::atomic<uint32_t> m_parentLoopRefs that tracks all outstanding parent-loop refs; postMessage increments it under m_pendingTasksMutex (after checking ClosingFlag), the delivery lambda fetch_subs and only unrefs if the prior value was >0, and dispatchExit sets ClosingFlag and exchange(0)s the counter under the same mutex to sweep up any in-flight refs whose tasks will never run. Also touched: a one-line allowlist addition for __aarch64_swp4_relax, a new root-pch.h (appears to be from a main merge, unrelated to the fix), and a regression test.
Security risks
None. This is internal event-loop liveness bookkeeping; no auth, crypto, untrusted input parsing, or permission surfaces are involved.
Level of scrutiny
High. This is concurrency-sensitive code on a hot path (worker_threads postMessage) where an unbalanced ref/unref hangs the parent process. The PR history shows the design has been reworked repeatedly to close TOCTOU windows between postMessage, fireEarlyMessages, and dispatchExit. The new atomic-counter approach is cleaner than the prior mutex-guarded integer, and it does close the specific paths I raised last round (ClosingFlag now set under the mutex in dispatchExit; both pre-online and post-online refs are tracked). I did not find a concrete leak in the new design — the one theoretical concern is fetch_sub underflow if more than one task executes after dispatchExit's exchange(0), but since both run serially on the worker thread that shouldn't occur in practice.
Other factors
- CI on 6026b68 (build #47077) shows widespread build-zig/build-cpp failures; the follow-up commit message ("Retrigger gate (libuv TLS fetch infra issue)") suggests infra, but there's no confirmed green run on the 8f17fa0 design yet.
- The diff includes an unrelated new file
src/bun.js/bindings/root-pch.h— likely a merge artifact, but worth confirming it's intentional. - Given how many subtle races were found and fixed iteratively here, a maintainer familiar with the worker lifecycle (terminate/exitAndDeinit ordering, whether tasks can ever run post-
dispatchExit) should review the final invariant rather than rely solely on automated review.
5320b30 to
851deb8
Compare
There was a problem hiding this comment.
The latest m_parentLoopRefs atomic design is cleaner than the prior iterations and I don't see an obvious leak in the common paths, but given how many subtle ref-counting races this PR has cycled through (and that it changes core worker lifecycle / event-loop liveness for every postMessage), a human should sign off — in particular it's worth double-checking the fetch_sub underflow behavior if more than one already-posted task runs after dispatchExit's exchange(0), and confirming CI #47086 is green.
Extended reasoning...
Overview
This PR fixes #28643 by ref-counting the parent event loop around cross-thread postMessage delivery so an unref()'d worker doesn't let the parent exit before the round-trip completes. It touches Worker::postMessage, jsFunctionPostMessage, dispatchExit, adds std::atomic<uint32_t> m_parentLoopRefs to Worker.h, adds a regression test, and adds one LSE outline-atomic to the aarch64 baseline allowlist (a side effect of the new exchange/fetch_* calls).
The current revision (851deb8) is a new design relative to what I last commented on: instead of counting via m_pendingTasks.size(), every postMessage increments m_parentLoopRefs and calls refEventLoop() under m_pendingTasksMutex (after a ClosingFlag check under the same lock); the delivery lambda does fetch_sub(1) and only posts unrefEventLoop if the prior value was > 0; dispatchExit sets ClosingFlag and does exchange(0) under the same lock to sweep any in-flight refs. The online postTaskTo is also issued while holding the lock so it can't interleave with dispatchExit's snapshot.
Security risks
None apparent — this is event-loop liveness bookkeeping, not auth/crypto/input handling. The drive-by fix to disentangledPorts.releaseException() removes a pre-existing UB path.
Level of scrutiny
High. This is hot-path, cross-thread ref-counting that determines whether the parent process exits or hangs. Over the life of this PR I (and CodeRabbit) flagged at least 8 distinct races/leaks across successive revisions, several of which were confirmed by CI hangs. The current design looks like the most robust so far (the ClosingFlag-under-mutex + atomic-counter sweep closes the pre-online race and the post-online "task dropped during teardown" leak), but it's a fresh rewrite that hasn't yet had a green CI run posted to the thread (#47086 still building per the status comment).
Other factors
- One residual edge case worth a human eye: if
dispatchExit'sexchange(0)runs and then two or more already-queued worker-side tasks still execute afterwards, the first seesprev == 0and skips, but the second seesprev == UINT32_MAX(underflowed) which is> 0and would post an extraunrefEventLoop. Whether worker-thread sequencing actually allows multiple tasks to run afterdispatchExitreturns is the question — robobun's earlier note that "terminate() drains pending tasks before VM teardown" suggests it might. fireEarlyMessagesno longer needs to touch the counter (the lambdas decrement it themselves), which fixes the earlier async-batch-dropped concern.- The author has explicitly acknowledged a remaining narrow window in earlier comments; that plus the iteration count makes this a poor fit for bot-only approval.
0177d3a to
ed8dfe7
Compare
The hot.test.ts 'should work with sourcemap generation' flake hit ERROR bucket on build 51604 instead of WARNING bucket — but this is a well-documented global flake, not PR-specific: - #28645 retriggered 3x ('seen 5+ times, unrelated to PR') - #29945, #30018, #30026 all retriggered for the same test - #29587 build 51335: 'windows-11-aarch64 shard 6/8: hot.test.ts — known hot-reload flake' - #29735 open: 'test(hot): use atomic rename for 2MB rewrites in sourcemap-generation test' — test-level workaround - #29740 merged: 'hot: defer reload while a rejected module is unreported' — prior runtime fix attempt Local reproduction: ~5% flake rate at my branch tip (e86f538), at 5215594 (before drain-in-reload), and at 0ce2096 (before the forever-timer carve-out). Rate is identical across all three, which means the flake is inherent to the test (qemu-aarch64 agent slowness hitting the 10s timeout before 50 hot-reload cycles complete) and not caused by any change on this PR. Also same plain-old test-http-should-emit-close flake on 2019 x64 baseline lanes (the one I've been retrying since build 51163).
Hot reload 'should work with sourcemap generation' flakes at ~5% on windows-11-aarch64 under qemu: the 10s per-test timeout fires before 50 hot-reload cycles complete, independent of this PR. Cross-PR evidence from the last 24h: - #28645 retriggered 3 times for the same test - #29945, #30018, #30026 all retriggered - #29587 build 51335: documented as 'known hot-reload flake' - #29735 open: test-level workaround in progress - #29740 merged: runtime fix for a related --hot sourcemap race Local repro: identical ~5% rate at my tip (e86f538), at 5215594 (before the stranded-defer fix), and at 0ce2096 (before the forever-timer carve-out). Rate unchanged across the three, confirming the flake is not in this PR's footprint. Same persistent test-http-should-emit-close flake on 2019 x64-baseline (seen in 51163, 51423, 51448, 51473, 51503, 51528, 51604).
There was a problem hiding this comment.
No new issues found and all prior review threads are resolved, but this is cross-thread event-loop ref-counting in the Worker core with two acknowledged residual gaps (the nested-worker m_parentBunVM lifetime and the narrow post-online drop window), plus an unrelated test rewrite bundled in — worth a human pass before merge.
Extended reasoning...
Overview
This PR adds per-message parent-event-loop ref/unref accounting across Worker::enqueueToWorker, enqueueToParent, drainInbox, and dispatchExit in src/jsc/bindings/webcore/Worker.cpp, moves the State::Closing transition into dispatchExit under m_toWorker.lock, caches m_parentBunVM in Worker.h, updates the lifecycle state-diagram docs, adds a 4-case regression test, and — unrelated to the headline fix — rewrites test-http-should-emit-close-when-connection-is-aborted.ts to use a raw net.Socket instead of fetch()+AbortController to deflake on Windows. It also adds one LSE outline-atomic to the aarch64 allowlist.
Security risks
None identified. The change is internal event-loop liveness bookkeeping; no auth, crypto, parsing, or untrusted-input handling is touched. The worst failure modes are correctness (process hang or premature exit), not security.
Level of scrutiny
High. This is cross-thread C++ touching the Worker lifecycle state machine and the parent VM's concurrent_ref counter, where an off-by-one in either direction means a hung process or a premature exit. The PR went through ~20 automated review rounds that surfaced and fixed real races (TOCTOU on ClosingFlag, double-unref on the post-online path, stale-counter underflow, append-after-snapshot in dispatchExit, worker-thread deref of the parent context). The final design is coherent, but two limitations are explicitly left open and documented in-code: (a) m_parentBunVM is a raw cached pointer that dangles in the nested-worker case once the middle worker's arena is freed — pre-existing gap, but this PR adds new dereference sites; (b) a narrow post-online window where a successfully-posted task can be dropped during VM teardown. Both are acknowledged in the thread as accepted tradeoffs, which is a judgment call a human should ratify.
Other factors
All inline review threads are resolved and the latest bug-hunt pass found nothing new. CI failures on the head commit are the chronic fetch-http2-client WebKit flake and musl build agents per the retrigger commit messages, not this change. The bundled test-http-should-emit-close-when-connection-is-aborted.ts rewrite is reasonable on its own but is scope creep relative to the PR title and would ideally be confirmed as intentional. Given the subtlety of the invariant (every refEventLoop balanced exactly once across enqueue/drain/dispatchExit) and the documented residual gaps, this is not a change I'm comfortable shadow-approving without a human sign-off.
There was a problem hiding this comment.
No new issues found, but this is cross-thread event-loop ref-counting in Worker.cpp with several acknowledged residual edge cases (nested-worker VM lifetime, narrow post-online race) — worth a human pass before merge.
Extended reasoning...
Overview
This PR changes how worker.unref() interacts with in-flight messages by adding per-message parent event-loop ref/unref accounting in src/jsc/bindings/webcore/Worker.cpp / Worker.h. It introduces a cached m_parentBunVM pointer, calls Bun__eventLoop__incrementRefConcurrently from the worker thread, moves the State::Closing transition from the parent-thread close lambda into dispatchExit() on the worker thread under m_toWorker.lock, and adds queue-drain compensation for leaked refs. It also adds a 4-case regression test, rewrites an unrelated flaky http-close test to use net.Socket instead of fetch(), and adds __aarch64_swp4_relax to the LSE allowlist.
Security risks
None identified. No auth/crypto/permissions surface; the change is internal event-loop liveness bookkeeping. The cached raw m_parentBunVM pointer is a documented UAF surface in the nested-worker case, but that's a pre-existing lifetime gap the PR explicitly notes rather than a new exposure of untrusted data.
Level of scrutiny
High. This is production-critical cross-thread concurrency in the worker_threads message path, where an off-by-one ref either hangs the parent process or causes premature exit. The PR went through ~20 review rounds fixing TOCTOU races, double-unrefs, ref leaks on failure paths, and a GC-reachability regression — and the author has explicitly chosen to document (rather than fix) two residual edge cases: the nested-worker parent-VM-teardown UAF and a narrow post-online drop window. Those tradeoffs, plus the relocation of the Closing state transition across threads (which affects hasPendingActivity() / GC reachability timing), deserve a human sign-off.
Other factors
All prior inline comments are resolved and the current bug-hunting pass found nothing new. CI on the latest commit shows musl build failures and a known-flaky fetch-http2-client assert that the branch's own retrigger commits attribute to infra, not this change. The PR also bundles an unrelated test rewrite (test-http-should-emit-close-when-connection-is-aborted.ts) and an allowlist addition, which a reviewer may want to confirm are intentional riders.
b82dd49 to
38faeaf
Compare
There was a problem hiding this comment.
No new issues found in this revision, but this is intricate cross-thread event-loop ref-counting in Worker.cpp that went through ~20 review iterations with a couple of acknowledged residual edge cases (post-online drop race, nested-worker VM lifetime), and the diff also bundles an unrelated rewrite of test-http-should-emit-close-when-connection-is-aborted.ts plus musl build failures in CI — worth a human pass before merge.
Extended reasoning...
Overview
The substantive change is in src/jsc/bindings/webcore/Worker.cpp / Worker.h: per-message refEventLoop/unrefEventLoop accounting across enqueueToWorker, enqueueToParent, drainInbox, and dispatchExit, plus moving the State::Closing transition under m_toWorker.lock on the worker thread and caching m_parentBunVM for cross-thread atomic ref/unref. A new regression test (test/regression/issue/28643.test.ts) covers four unref scenarios. The remaining .rs hunks are pure rustfmt whitespace from autofix.ci. There is also an unrelated, undocumented rewrite of test/js/bun/test/parallel/test-http-should-emit-close-when-connection-is-aborted.ts (switching from fetch()+AbortController to a raw net.Socket).
Security risks
None identified. No auth, crypto, permissions, or untrusted-input parsing is touched. The closest risk surface is the documented nested-worker UAF on m_parentBunVM (a pre-existing lifetime gap this PR widens slightly), which is a robustness concern rather than a security boundary.
Level of scrutiny
High. This is lock-protected cross-thread state with at least three interacting actors (parent thread, worker thread, VM teardown), and getting the ref balance wrong in either direction hangs or prematurely exits user processes. The PR history shows multiple genuine races found and fixed across iterations, and the author has explicitly acknowledged two remaining narrow windows as out of scope. That judgment call — whether the residual races are acceptable — should be made by a maintainer, not a bot.
Other factors
- CI build #54980 reports build failures on x64-musl / aarch64-musl / x64-musl-baseline at the current head.
- The HTTP test rewrite is not mentioned in the PR description or any review comment; it may be an intentional flaky-test fix bundled in during a rebase, but a human should confirm it belongs here.
- All prior inline review threads (mine and CodeRabbit's) are marked resolved, and the most recent fix (e4d0b17) restored the #53748 guard that a rebase had dropped.
There was a problem hiding this comment.
All my inline comments are now resolved and the bug hunter found nothing new on the latest revision, but the cross-thread event-loop ref-counting in Worker.cpp/Worker.h (with its acknowledged nested-worker edge case) and the unrelated ride-along changes (test-http-should-emit-close-when-connection-is-aborted.ts, bake-codegen.ts) are substantial enough that a human should sign off.
Extended reasoning...
Overview
The core change is per-message parent event-loop ref/unref accounting in src/jsc/bindings/webcore/Worker.cpp and Worker.h: enqueueToWorker/enqueueToParent now ref the parent VM's event loop per message, drainInbox unrefs per dispatch, and dispatchExit snapshots and unwinds any leftover refs under m_toWorker.lock while transitioning to State::Closing. A new m_parentBunVM is cached at construction so worker-thread call sites can ref/unref without dereferencing the parent's ScriptExecutionContext. The State::Closing transition moved from the parent-thread close lambda to the worker thread under the inbox lock, and the header's state-diagram comments were updated to match. A new regression test (test/regression/issue/28643.test.ts) covers four worker.unref() scenarios.
The diff also carries several changes unrelated to the PR title: a rewrite of test-http-should-emit-close-when-connection-is-aborted.ts from fetch()+AbortController to raw net.Socket, a JSON.stringify wrapper in bake-codegen.ts, and pure rustfmt reflows of multi-line #[cfg(...)] attributes across ~10 .rs files.
Security risks
None identified. The change does not touch auth, crypto, permissions, or untrusted input parsing. The new Bun__eventLoop__incrementRefConcurrently extern is an existing Zig symbol; the cached m_parentBunVM raw pointer is documented as relying on the parent VM outliving the worker (true for main-thread parents; the nested-worker case is explicitly noted as a pre-existing known gap).
Level of scrutiny
High. This is hand-rolled cross-thread lifecycle/ref-count logic in a runtime-critical path (every Worker.postMessage), and the PR went through ~20 review rounds over six weeks fixing successive TOCTOU/double-unref/leak races — several of which were only caught by careful interleaving analysis, not by tests. Two edge cases remain explicitly acknowledged-but-deferred in the code/timeline: (a) the nested-worker m_parentBunVM lifetime gap (documented at the extern declaration and dispatchExit's posting-failure branch), and (b) the narrow post-online postTaskTo-succeeds-then-task-dropped window. A human reviewer with context on Bun's worker/VM lifecycle should confirm those tradeoffs are acceptable.
Other factors
- CI build #54997 on the head commit
5e12a257reports failures, includingbuild failedon 🐧 x64-musl / x64-musl-baseline. Most of the listed items are pre-existing lint/deprecation warnings in files this PR doesn't touch, but the musl build failures should be confirmed as unrelated before merge. - The unrelated ride-alongs (HTTP test rewrite,
bake-codegen.ts, rustfmt reflows) appear in separate commits with their own messages, so they're intentional, but they're not described in the PR body and a reviewer should be aware they're in scope. - All 20+ of my prior inline comments are marked resolved, including today's catch of a rebase regression that reverted the #53748 null-guard in
createNodeWorkerThreadsBinding(restored in e4d0b17).
When worker.unref() is called before postMessage(), the parent event loop exits before receiving the worker's response. Node.js keeps the parent alive while a message is in flight; Bun did not. Cause: unref() drops the worker's polling ref on the parent event loop. A message posted after that drop had nothing keeping the loop alive for the round-trip — the process exited silently with the message queued but never dispatched. Fix: take a per-message refEventLoop when a message is queued in enqueueToWorker / enqueueToParent, balance it with an unref in drainInbox after each dispatch, and clean up any stragglers in dispatchExit. The ref targets the parent VM directly via cached m_parentBunVM + Bun__eventLoop__incrementRefConcurrently, so worker- thread code never dereferences the parent ScriptExecutionContext (which could be freed concurrently). Also moves the State::Closing transition into dispatchExit() on the worker thread under m_toWorker.lock, so enqueueToWorker on the parent observes the flag and silently drops new messages without taking a ref it would never release. Regression test in test/regression/issue/28643.test.ts covers four cases via test.concurrent.each: SharedArrayBuffer round-trip, string round-trip, worker fails to start, and no-messages exit. Also bundled: deflake test/js/bun/test/parallel/test-http-should-emit-close-when-connection-is-aborted.ts by switching from fetch()+AbortController to net.Socket. fetch() abort hops to the HTTP client thread before the socket is closed, and on Windows that cross-thread close races the server socket's AFD poll re-submission — the disconnect is never observed and the test times out. A net.Socket lives on the same event loop and refs it while connected, so destroy() is observed deterministically. Fixes #28643.
rebase onto origin/main (4d443e5) accidentally reverted the `createNodeWorkerThreadsBinding` hunk that #30412 (Rewrite Bun in Rust) added in 23427db: the `if (serialized)` null-guard around `serialized->deserialize`, the `dynamicDowncast<JSArray>` + `ASSERT_NOT_REACHED` else branch, and the #53748 comment. Those guards exist because `options.workerDataAndEnvironmentData` is moved-from on the first call — a lazy-init re-entry finds `serialized` null, `->deserialize` dereferences null (UB → garbage), and the `uncheckedDowncast` below SIGTRAPs on darwin. Unrelated to the `worker.unref()` fix; restored verbatim from main.
…BADF), all retried, 280/282 jobs green
…bootstrap bun The pre-Rust-rewrite bootstrap bun (before #30679 fixed the JSON lexer to tokenize `?`/`*`/`(`/`)`) aborts `Lexer::init` on CSS operators, before the `parse_env_json` auto-quote fallback has a chance to run. Explicitly JSON.stringify the CSS result so the value is already valid JSON by the time `define` sees it.
c7d2994 to
0825090
Compare
There was a problem hiding this comment.
No further issues found from automated review. This still warrants a human pass: it reworks cross-thread event-loop ref-counting in Worker.cpp (with a couple of acknowledged-but-deferred edge cases — the nested-worker m_parentBunVM lifetime and the narrow post-online drop window), and the branch has picked up two unrelated changes (src/codegen/bake-codegen.ts and the test-http-should-emit-close-when-connection-is-aborted.ts rewrite) that aren't covered by the PR description.
Extended reasoning...
Overview
This PR fixes #28643 by adding per-message parent-event-loop ref/unref accounting to Worker::enqueueToWorker, enqueueToParent, drainInbox, and dispatchExit in src/jsc/bindings/webcore/Worker.cpp, plus a cached m_parentBunVM pointer in Worker.h, updated state-machine docs, and a four-case regression test. It also bundles two unrelated changes: a JSON.stringify wrapper around OVERLAY_CSS in src/codegen/bake-codegen.ts, and a rewrite of test-http-should-emit-close-when-connection-is-aborted.ts from fetch()+AbortController to a raw net.Socket.
Security risks
None identified. The change is internal event-loop liveness bookkeeping; no new attack surface, auth, crypto, or untrusted-input parsing is touched.
Level of scrutiny
High. This is core runtime concurrency code — cross-thread atomics, lock ordering, and ref-count balancing across worker startup/teardown/terminate paths. The PR went through ~20 automated review rounds, each surfacing real ref-leak / double-unref / TOCTOU races that were subsequently fixed; that history alone indicates the invariant is subtle enough to merit a human reading of the final shape rather than the incremental diffs.
Other factors
- Two edge cases were explicitly acknowledged and deferred rather than fixed: (a) the nested-worker case where
m_parentBunVMcan dangle after the middle worker's arena is freed (now documented in comments as a known gap), and (b) a narrow post-onlinepostTaskTo-succeeds-then-task-dropped window. A human should confirm those tradeoffs are acceptable. - The
State::Closingtransition moved from the parent-thread close lambda to the worker thread underm_toWorker.lock, which changes whenhasPendingActivity()/wasTerminated()flip — worth a sanity check against GC-reachability expectations. - The
bake-codegen.tsand HTTP-abort-test changes are out of scope for the PR title/description and should either be split out or noted in the description. - All previously-raised inline threads (including the accidental revert of the #53748 null-guard during rebase) are resolved in the current diff, and the bug-hunting system found nothing new on this revision.
Moving State::Closing into worker-thread dispatchExit() (under m_toWorker.lock) widened the wasTerminated() window: a final `parentPort.postMessage(result)` followed by worker exit now produces this parent-thread interleaving — T1 (drainToParent: message event) T2 (close task) — with State::Closing set between them. Inside the user's 'message' handler for T1, the worker is still logically live (the parent is actively dispatching its messages), but wasTerminated() read true and reported threadId === -1, inspector terminated === true, and silently no-op'd worker.ref()/unref(). Pre-PR, Closing was set inside T2 on the parent thread (FIFO-after T1) so message handlers always saw Running. Gate wasTerminated() and setKeepAlive() on `== Closed` instead. The enqueueToWorker inbox-close check keeps `>= Closing` — its intent is 'worker thread has stopped draining', which Closing correctly signals.
d861fe9 changed wasTerminated() to `== Closed` to keep the final 'message' handler from observing a terminated worker. That fixed the trailing-message window but broke the reverse case: test-worker- safe-getters.js expects `w.threadId === -1` inside the 'exit' handler — at which point State is still Closing (Closed is stored AFTER the close event dispatches). Restore the pre-PR State::Closing semantics (transition stays parent-thread-only, inside the close-task lambda, FIFO-after any trailing drainToParent tasks) and add a separate `MessageInbox::closed` flag for the enqueueToWorker gate. dispatchExit sets `m_toWorker.closed` under m_toWorker.lock on the worker thread — same synchronization as before, but orthogonal to the user-visible State. This is the cleanest design: State::Closing means 'close event is dispatching on the parent' and is parent-thread-observable; the inbox flag means 'worker thread has stopped draining' and is worker-thread-settable. Neither contract leaks into the other.
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
src/jsc/bindings/webcore/Worker.h:123-128— ChangingwasTerminated()to== State::Closedovershoots: the close lambda dispatches the native'close'event whilem_state == Closingand only storesClosedafterdispatchEventreturns, so inside the user's'exit'handlerwasTerminated()is now false andworker.threadIdreturns the real id instead of-1. This failstest/js/node/test/parallel/test-worker-safe-getters.js:27(assert.strictEqual(w.threadId, -1)) and contradicts the still-present design comments at Worker.h:85-88 / Worker.cpp:617-620. The earlier review explicitly flagged this trade-off for option (b); option (a) — a separate inbox-closed flag for theenqueueToWorkergate, leavingm_stateparent-thread-owned — avoids it.Extended reasoning...
What the bug is. Commit d861fe9 changed
wasTerminated()fromm_state.load() >= State::Closingtom_state.load() == State::Closedto fix the trailing-'message'-handler case raised in the resolved review thread (#3265015298). That fix overshoots in the other direction: it breaks Node.js semantics inside the'exit'handler, whereworker.threadIdis required to read-1.The specific code path.
dispatchExit()(Worker.cpp:605) setsm_state = State::Closingon the worker thread underm_toWorker.lock, then posts the close lambda to the parent.- The close lambda (Worker.cpp:614–634) dispatches the native
'close'event at lines 623–626 whilem_state == Closing;State::Closedis only stored at line 628, afterdispatchEventreturns. worker_threads.ts:259listens for the native'close'event and synchronously re-emits'exit'(#onClose → this.emit("exit", e.code), worker_threads.ts:352–354).- Inside the user's
'exit'handler,worker.threadId(worker_threads.ts:275–276) →JSWorker.cpp:387(if (worker.wasTerminated()) return jsNumber(-1)) →wasTerminated()→Closing == Closed→ false → returns the real thread id, not-1.
Why existing code does not prevent it. The new comment on
wasTerminated()says it's gated onClosedso the trailing'message'handler still sees a live worker. That's correct for the'message'case but ignores that the'close'/'exit'event is also dispatched while the state is stillClosing. Two stale design comments confirm the regression was unintentional: Worker.h:85–88 ("Closing exists so that inside the 'close'/'exit' handler threadId reads -1 and isOnline() is false") and Worker.cpp:617–620 ("The split lets 'close'/'exit' handlers observe threadId == -1 and isOnline() == false") — both still describe the now-broken behavior. The same applies tosetKeepAlive, which was also changed to gate on== Closed.Step-by-step proof (test-worker-safe-getters.js).
const w = new Worker(...); w.on('exit', () => { assert.strictEqual(w.threadId, -1); });— the worker exits viaprocess.exit(0)(noterminate()), som_terminateRequestedis false andwasTerminated()depends solely onm_state.- Worker thread:
dispatchExit()→m_state = Closing→ posts close lambda. - Parent thread: close lambda runs →
dispatchEvent(closeEvent)whilem_state == Closing. worker_threads.ts#onClosefires synchronously inside that dispatch →this.emit('exit', e.code).- User's
'exit'handler:w.threadId→ native getter →worker.wasTerminated()→Closing == Closed→ false → returnsclientIdentifier()-1(the real id). assert.strictEqual(w.threadId, -1)fails.- Pre-d861fe97,
wasTerminated()checked>= Closing, so step 5 returnedtrueand the assertion passed. (robobun's earlier local verification of this test was on e24a506, before d861fe9 landed.)
Impact.
test/js/node/test/parallel/test-worker-safe-getters.js:27regresses (it is not in any skip list). More generally, any code that relies on Node's documentedthreadId === -1/ inspector-terminated === trueinside an'exit'handler now observes the worker as still alive. This is exactly the trade-off the resolved review thread warned about for option (b): "slightly changes threadId semantics inside the 'close' handler itself (it would read the real id instead of -1), which may or may not be acceptable" — the Node parallel test shows it is not.How to fix. Two options that preserve both the trailing-
'message'fix and the'exit'-handler semantics:- Option (a) from the original review: add a separate inbox-closed flag (e.g.
bool MessageInbox::closed) set underm_toWorker.lockindispatchExitand checked under the same lock inenqueueToWorkerfor the ref-leak serialization, and restorem_state.store(State::Closing)to the parent-thread close lambda sowasTerminated()can keep>= Closing. This gives both'message'-handler-sees-Running and'exit'-handler-sees-terminated. - Alternatively, store
State::ClosedbeforedispatchEvent(closeEvent)(and adjust thedispatchEvent()Closed-suppression gate to allow the close event through), but that conflates Closing/Closed and would need the Worker.h state-diagram comments updated.
Either way, the comments at Worker.h:85–88 and Worker.cpp:617–620 should be brought back into agreement with whatever
wasTerminated()ends up doing.
|
(re: claude[bot] review 4317723031) — yes, as the review notes, the PR was updated during analysis. 6fa3f35 implements your recommended Option A: restored |
Stray 'Both are' left over from an edit collision — the sentence had
two finite verbs ('are' / 'would let') with no conjunction. Rewrite
to say what it meant: storing Closing on the worker thread between
(1) drainToParent and (2) the close task would let (1)'s message
event observe `wasTerminated()==true`, so keep the store inside
(2)'s lambda on the parent thread.
…30679 bootstrap bun" This reverts 0825090. Main's JSON lexer has handled CSS-in-define via the auto-quote fallback since #30679; the workaround is no longer needed and the existing in-tree references (bun-build-api.test.ts, json_lexer.rs) cite bake-codegen's unquoted OVERLAY_CSS as the motivating example for that path — keeping the wrapper made those citations stale. Unrelated to the worker.unref() fix.
|
|
||
| // Use a raw net.Socket for the client rather than fetch() + AbortController. | ||
| // fetch() abort hops to the HTTP thread before the socket is closed, and on | ||
| // Windows that cross-thread close can race the server socket's AFD poll | ||
| // re-submission, so the disconnect is never observed and the test times out. | ||
| // A net.Socket lives on the same event loop as the server and refs it while | ||
| // connected, so destroy() is observed deterministically. The server-side | ||
| // behavior under test (IncomingMessage "close" + aborted=true on client | ||
| // disconnect) is independent of which client tore down the connection. | ||
| const socket = net.connect({ port: server.address().port, host: "127.0.0.1" }); | ||
| await once(socket, "connect"); | ||
| socket.write("GET / HTTP/1.1\r\nHost: localhost\r\n\r\n"); | ||
|
|
There was a problem hiding this comment.
🟡 This rewrite (fetch+AbortController → raw net.Socket, dropping server.unref()) is a Windows AFD-poll flake fix per its own inline comment and has no connection to the worker.unref()/pending-message change — it touches no worker code, no event-loop ref accounting, and isn't mentioned in the PR description. There's direct precedent in this PR for splitting out-of-scope hunks (the bake-codegen.ts change was reverted in 3dbff26 with "Out of scope for this PR anyway"); consider doing the same here, or at minimum mention it in the PR description. Not blocking — test-only flake fix.
Extended reasoning...
What the issue is. This hunk rewrites test-http-should-emit-close-when-connection-is-aborted.ts from fetch()+AbortController to a raw net.Socket, and removes the server.unref() call. The inline comment explicitly attributes the rewrite to a Windows-specific race: "fetch() abort hops to the HTTP thread before the socket is closed, and on Windows that cross-thread close can race the server socket's AFD poll re-submission, so the disconnect is never observed and the test times out." That is an http.Server/fetch() cross-thread-abort flake fix — it has nothing to do with the worker.unref()/pending-message bug (#28643) this PR is about.
Why it's verifiably out of scope. (1) The file references no Worker/worker_threads code at all. (2) The PR's runtime changes are confined to Worker.cpp/Worker.h — enqueueToWorker/enqueueToParent/drainInbox/dispatchExit ref-accounting — which share no code path with http.Server.unref() (a uv_unref on a TCP handle), fetch(), or net.Socket. (3) The PR title, description, and the entire 25+-round review timeline never mention this file, and CodeRabbit's auto-generated changed-files summary and "Out of Scope Changes check" both ran before this hunk was added. (4) git log shows the file was last touched by an unrelated Bun.serve/http3 PR before being modified here.
Precedent within this same PR. Inline comment 3265977283 flagged the bake-codegen.ts hunk as unrelated to the worker.unref() fix, and the author reverted it in 3dbff26 with the message "Out of scope for this PR anyway." This hunk is the same class of issue: a tangential change bundled into a long-running PR (likely to keep CI green during the ~7-week review cycle), with no mention in the PR description.
Step-by-step.
- PR title/description: "Fix worker.unref() causing pending messages to be lost" —
Worker.cppevent-loop ref accounting for cross-threadpostMessage. - This hunk's stated rationale (lines 9–17 of the new file): Windows AFD poll re-submission race with
fetch()'s HTTP-thread abort hop. No worker, nounref(), no event-loop ref accounting. - Code paths touched by the PR's runtime changes:
Worker::enqueueToWorker,Worker::enqueueToParent,drainInbox,dispatchExit,MessageInbox. None of these are reachable fromhttp.createServer,net.connect, orfetch(). - Therefore the test rewrite cannot be exercising or validating anything this PR changed.
Impact. None on correctness — it's a test-only flake fix with a thorough inline rationale. The concern is purely scope/hygiene: it makes the PR harder to review and bisect, and CodeRabbit's pre-merge "Out of Scope Changes" check passed only because it ran before this file was added. It's reasonable to bundle a CI-flake fix encountered during a long-running PR's many CI rounds, but the bar set within this same PR (3dbff26) was to split such hunks out.
How to fix. Either revert this hunk and land it as its own PR (matching the bake-codegen.ts treatment), or — if keeping it bundled to avoid another CI round — add a sentence to the PR description noting the unrelated Windows http test flake fix so future bisects/blame don't have to puzzle over why a worker.unref() PR rewrote an http abort test.
Problem
Calling
worker.unref()beforeworker.postMessage()causes the parent event loop to exit before the worker can process the message and send a response back. This meansparentPort.postMessage()responses are silently dropped.Closes #28643
Reproduction
Root Cause
When
worker.unref()is called, the parent event loop'sactivecounter is decremented to zero. The event loop then checksisEventLoopAlive()before processing concurrent tasks from worker threads, finds nothing keeping it alive, and exits immediately — even if the worker has messages in-flight or hasn't processed the parent's message yet.The concurrent task queue used for cross-thread message delivery does not contribute to the event loop's liveness check unless
concurrent_refis explicitly incremented.Fix
Two ref/unref pairs in
Worker.cpp:Worker::postMessage(parent→worker): Ref the parent event loop before posting a message to the worker. After the worker finishes processing the message (synchronous handler completes), post a task back to the parent to unref. This keeps the parent alive while the worker handles the message.jsFunctionPostMessage(worker→parent): Ref the parent event loop before posting the response message. Unref after the message is dispatched on the parent thread. This ensures the parent stays alive to receive the worker's reply.Verification
USE_SYSTEM_BUN=1 bun test test/regression/issue/28643.test.ts→ 2 FAIL, 1 passbun bd test test/regression/issue/28643.test.ts→ 3 passAlso bundled
test/js/bun/test/parallel/test-http-should-emit-close-when-connection-is-aborted.ts— Windows flake fix unrelated to worker.unref(). The test was timing out becausefetch()'s abort hops to the HTTP client thread, and on Windows that cross-thread close races the server socket's AFD poll re-submission. Switched to a rawnet.Socketclient which lives on the same event loop as the server, sodestroy()is observed deterministically. Kept bundled because reverting it would re-break CI on this branch; noted here per review for future bisect/blame hygiene.