hot: defer reload while a rejected module is unreported#29740
Conversation
`vm.source_mappings` is a path-hash → blob table overwritten in place on every transpile. The hot-reload Task loop drains microtasks between tasks, so a watcher event that arrives after a module's eval rejects but before `reportExceptionInHotReloadedModuleIfNeeded()` prints it can run another `reload()` — which re-reads the file (possibly mid-rewrite) and overwrites the table entry. The still-unreported error is then either remapped against the wrong map (transpiled coords leak through, e.g. `:1:12`) or dropped entirely when the new pending promise replaces the old one. Two changes: - `VirtualMachine.reload()` now also defers when `pending_internal_promise` is `.rejected` and `pending_internal_promise_reported_at != hot_reload_counter`, not just when `.pending`. The deferred reload runs from `reportExceptionInHotReloadedModuleIfNeeded()` after the error is printed (and remapped against its own sourcemap). - `SavedSourceMap.putMappings()` keeps the existing entry when the incoming blob has zero mappings. A truncate-then-write rewrite can be observed as a comment-only prefix that transpiles to nothing; a 0-mapping map can never answer a lookup, so dropping it is never worse than installing it. The new test makes the window deterministic by having the hot file truncate itself to a comment-only stub immediately before throwing, guaranteeing a fresh watcher event lands between reject and report.
|
Updated 1:58 AM PT - Apr 26th, 2026
✅ @dylan-conway, your commit 52d8fbba8f7476ffad5e61dbeac24c96b6fea678 passed in 🧪 To try this PR locally: bunx bun-pr 29740That installs a local version of the PR into your bun-29740 --bun |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughImproves hot reload stability by preventing source map overwrites during transient transpile states and deferring reloads when promise rejections haven't been reported yet, with a regression test validating the fix. Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Found 2 issues this PR may fix:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
I didn't find any bugs, but this is a race-condition fix touching the hot-reload state machine in VirtualMachine.zig and sourcemap storage locking, so it's worth a human look — particularly the interaction between the new .rejected defer path and reportExceptionInHotReloadedModuleIfNeeded, and the noted rare hang on debug builds.
Extended reasoning...
Overview
This PR fixes a --hot race where a watcher event landing between a module's eval rejection and its error report would re-transpile and overwrite source_mappings[path], causing the unreported error to remap against the wrong sourcemap (or be dropped). It touches src/bun.js/VirtualMachine.zig (extends the existing hot_reload_deferred mechanism to cover .rejected-but-unreported promises), src/bun.js/SavedSourceMap.zig (skips overwriting an existing entry with a 0-mapping InternalSourceMap), and adds a regression test in test/cli/hot/hot.test.ts.
Security risks
None identified. The changes are confined to dev-time hot-reload error reporting and sourcemap bookkeeping; no auth, crypto, network, or untrusted-input parsing paths are affected.
Level of scrutiny
Moderate-to-high. While the diff is small and follows established patterns (the .rejected branch mirrors the existing pending_internal_promise_reported_at != hot_reload_counter check in reportExceptionInHotReloadedModuleIfNeeded), this is concurrency-sensitive logic in the core runtime: event-loop microtask ordering, promise state transitions, and a mutex-guarded hashtable. The putMappings change introduces a new lock/unlock cycle whose defer scoping is correct but subtle (scoped to the inner if block, so it releases before falling through to putValue). Race-condition fixes in this area benefit from review by someone familiar with the hot-reload state machine.
Other factors
The author explicitly flags that the new self-rewriting test "occasionally exposes a separate, pre-existing edge case" causing rare hangs on debug builds under load — a reviewer should weigh whether that's acceptable to land as-is or needs a skip/follow-up first. The bug-hunting system found no issues, the PR has a targeted regression test that passes on the new build and fails on system bun, and zig:check-all is clean. No CODEOWNERS cover these files.
There was a problem hiding this comment.
I didn't find any bugs, but this touches core hot-reload event-loop ordering and sourcemap lifetime in VirtualMachine.zig, and the author flagged that the new self-rewriting test can occasionally hang under load — worth a human sanity check on the deferral invariants before merging.
Extended reasoning...
Overview
This PR modifies src/bun.js/VirtualMachine.zig (the reload() deferral logic), src/bun.js/SavedSourceMap.zig (putMappings() early-return when the incoming map has zero mappings), and adds a regression test in test/cli/hot/hot.test.ts. The runtime change extends an existing deferral mechanism so that a watcher-triggered reload waits not just for a .pending promise but also for a .rejected-but-not-yet-reported one, ensuring the rejection is remapped against its own sourcemap before reloadEntryPoint() overwrites source_mappings[path].
Security risks
None identified. The change is confined to the --hot dev-mode reload path and the in-process sourcemap table; there is no new I/O surface, no auth/crypto/permissions code, and no user-controlled input beyond what --hot already reads from disk.
Level of scrutiny
This deserves a human look. The logic itself is small and traces correctly against reportExceptionInHotReloadedModuleIfNeeded() (line 718): after the deferred reload is picked up, reported_at == counter so the second reload() call proceeds rather than re-deferring, and hot_reload_counter is incremented afterward — so no defer loop. The putMappings guard reads mappingCount() only after checking len >= header_size (32 bytes) and dupes nothing on early return, so no leak. But this is core runtime event-loop ordering with subtle invariants (generation counters vs. promise pointers, microtask draining between watcher tasks), and regressions here would manifest as dropped errors or stuck reloads in dev mode rather than test failures.
Other factors
- The PR description explicitly notes the new test "occasionally exposes a separate, pre-existing edge case" that can hang on debug builds under parallel load. Even if independent of this fix, that's a CI-flakiness risk a maintainer should consciously accept.
- The robobun CI comment shows
bun-main-entry-point.test.tsfailing across all platforms on the pre-merge commit; main has since landed a fix for that test (ad5d333) and this branch was merged with main, but it's worth confirming the latest build is green. - The
putMappingszero-mapping guard is a defensive belt-and-suspenders change in addition to the primary deferral fix; the "never worse than installing it" reasoning is sound but it does change behavior for any caller that legitimately transpiles an empty file over a previously non-empty one (the old map lingers). That seems acceptable for the stated use case but is the kind of edge a maintainer should sign off on.
My previous attempt guarded the defer on `hasAnyHandleWork()`, but that
was a no-op on Windows in the --hot path: once `loadEntryPoint` breaks
out, the outer main loop enters `tickPossiblyForever` which installs a
4-minute `forever_timer` that's a uv_timer_t held on the same loop —
which makes `isActive()` report alive forever. The reload's watcher
task therefore always observed a 'live' loop and kept deferring, and
the reload never fired on Windows (Linux was fine because the timing
worked out differently). CI caught this on all three Windows lanes.
Switch to the actual question the defer is asking: 'is the C++ loader
chain still in flight on the microtask queue?' Drain microtasks and
re-check status:
- If the chain was mid-flight, the drain runs it to completion and
the promise transitions to `.fulfilled` or `.rejected` — same
outcome as the previous defer-then-report cycle would have
produced, one tick earlier.
- If it's still `.pending` after the drain, nothing else on the
microtask queue can settle it. Whatever's keeping it pending is
external (an unsettled TLA, a pending timer, etc.) and the
deferral would be permanent because
`reportExceptionInHotReloadedModuleIfNeeded` early-returns on
`.pending` before consuming `hot_reload_deferred`. Proceed.
The `.rejected` arm (sourcemap race from #29740) is preserved. The
hot tests — including `should not remap against a stale sourcemap
after a partial-file reload` — still pass.
The hot.test.ts 'should work with sourcemap generation' flake hit ERROR bucket on build 51604 instead of WARNING bucket — but this is a well-documented global flake, not PR-specific: - #28645 retriggered 3x ('seen 5+ times, unrelated to PR') - #29945, #30018, #30026 all retriggered for the same test - #29587 build 51335: 'windows-11-aarch64 shard 6/8: hot.test.ts — known hot-reload flake' - #29735 open: 'test(hot): use atomic rename for 2MB rewrites in sourcemap-generation test' — test-level workaround - #29740 merged: 'hot: defer reload while a rejected module is unreported' — prior runtime fix attempt Local reproduction: ~5% flake rate at my branch tip (e86f538), at 5215594 (before drain-in-reload), and at 0ce2096 (before the forever-timer carve-out). Rate is identical across all three, which means the flake is inherent to the test (qemu-aarch64 agent slowness hitting the 10s timeout before 50 hot-reload cycles complete) and not caused by any change on this PR. Also same plain-old test-http-should-emit-close flake on 2019 x64 baseline lanes (the one I've been retrying since build 51163).
Hot reload 'should work with sourcemap generation' flakes at ~5% on windows-11-aarch64 under qemu: the 10s per-test timeout fires before 50 hot-reload cycles complete, independent of this PR. Cross-PR evidence from the last 24h: - #28645 retriggered 3 times for the same test - #29945, #30018, #30026 all retriggered - #29587 build 51335: documented as 'known hot-reload flake' - #29735 open: test-level workaround in progress - #29740 merged: runtime fix for a related --hot sourcemap race Local repro: identical ~5% rate at my tip (e86f538), at 5215594 (before the stranded-defer fix), and at 0ce2096 (before the forever-timer carve-out). Rate unchanged across the three, confirming the flake is not in this PR's footprint. Same persistent test-http-should-emit-close flake on 2019 x64-baseline (seen in 51163, 51423, 51448, 51473, 51503, 51528, 51604).
The 'should not remap against a stale sourcemap after a partial-file reload' test exercises a deterministic reproduction of the race #29740 fixed, but the author of that PR explicitly noted the self-rewriting fixture occasionally exposes a separate pre-existing race ('reproduces on released bun too') that causes the test to time out on contended CI runners. This flake was blocking gate checks on PRs that otherwise didn't touch the sourcemap path (~20% flake rate on release builds under load). Add retry: 3 until the underlying race in #29740 is addressed. The retried test still asserts all 20 error-reload cycles complete.
With a non-zero coalesce interval the drain loop can race the self-rewriting hot file in a way that loses the pending rejection (the pre-existing edge case noted in #29740), which on release builds manifests as a 30 s timeout. A zero interval makes the drain loop non-blocking — it polls once and processes whatever is already buffered — which is the closest analogue to the pre-loop behaviour this test was tuned against, and keeps the self-write event landing in the reject→report window on all build profiles.
## What Runtime fix for the `--hot` sourcemap race that oven-sh#29735 works around at the test level. Two changes: - **`VirtualMachine.reload()`** now also defers when `pending_internal_promise` is `.rejected` but its error hasn't been printed yet (`pending_internal_promise_reported_at != hot_reload_counter`), not just when `.pending`. The deferred reload runs from `reportExceptionInHotReloadedModuleIfNeeded()` *after* the error is remapped and printed against its own sourcemap. - **`SavedSourceMap.putMappings()`** keeps the existing table entry when the incoming `InternalSourceMap` has zero mappings. A 0-mapping map can never answer a lookup, so dropping it is never worse than installing it; this defends against any other path that re-transpiles a comment-only partial read. ## Why `vm.source_mappings` is a path-hash → blob table overwritten in place on every transpile. The event-loop tick drains microtasks between tasks, so a watcher event that arrives after a module's eval rejects but before `reportExceptionInHotReloadedModuleIfNeeded()` prints it can run another `reload()` — which re-reads the file (possibly mid-rewrite, since a 2MB `writeFileSync` is truncate + several `write()`s) and overwrites the table entry. The still-unreported error is then either remapped against the wrong map (transpiled coords leak through, e.g. `:1:12` when the source is line 1003) or dropped entirely when the new `pending_internal_promise` replaces the old one. This was flaking on aarch64 in `should work with sourcemap generation` (see oven-sh#29735) and can also affect users whose editors save non-atomically. ## Test The new test in `test/cli/hot/hot.test.ts` makes the window deterministic: the hot file truncates itself to a comment-only stub immediately before throwing, guaranteeing a fresh watcher event lands between reject and report. - [x] `bun bd test test/cli/hot/hot.test.ts -t "should not remap against a stale sourcemap"` — 1 pass, 41 expect() calls (20 iterations) - [x] `USE_SYSTEM_BUN=1 bun test test/cli/hot/hot.test.ts -t "should not remap against a stale sourcemap"` — fails (the error is dropped and the test times out waiting for it) - [x] `bun bd test test/cli/hot/hot.test.ts -t sourcemap` — all 4 sourcemap tests pass - [x] `bun run zig:check-all` — clean **Note:** the self-rewriting hot file in the new test occasionally exposes a separate, pre-existing edge case where an entry that rewrites itself mid-eval and then throws can lose the error (reproduces on released bun too). I observed this as a rare hang on debug builds when running under heavy parallel load; it did not reproduce in serial `bun bd test` runs. If it surfaces in CI it's worth a follow-up — the root cause is independent of this change.
With a non-zero coalesce interval the drain loop can race the self-rewriting hot file in a way that loses the pending rejection (the pre-existing edge case noted in #29740), which on release builds manifests as a 30 s timeout. A zero interval makes the drain loop non-blocking — it polls once and processes whatever is already buffered — which is the closest analogue to the pre-loop behaviour this test was tuned against, and keeps the self-write event landing in the reject→report window on all build profiles.
With a non-zero coalesce interval the drain loop can race the self-rewriting hot file in a way that loses the pending rejection (the pre-existing edge case noted in #29740), which on release builds manifests as a 30 s timeout. A zero interval makes the drain loop non-blocking — it polls once and processes whatever is already buffered — which is the closest analogue to the pre-loop behaviour this test was tuned against, and keeps the self-write event landing in the reject→report window on all build profiles.
With a non-zero coalesce interval the drain loop can race the self-rewriting hot file in a way that loses the pending rejection (the pre-existing edge case noted in #29740), which on release builds manifests as a 30 s timeout. A zero interval makes the drain loop non-blocking — it polls once and processes whatever is already buffered — which is the closest analogue to the pre-loop behaviour this test was tuned against, and keeps the self-write event landing in the reject→report window on all build profiles.
With a non-zero coalesce interval the drain loop can race the self-rewriting hot file in a way that loses the pending rejection (the pre-existing edge case noted in #29740), which on release builds manifests as a 30 s timeout. A zero interval makes the drain loop non-blocking — it polls once and processes whatever is already buffered — which is the closest analogue to the pre-loop behaviour this test was tuned against, and keeps the self-write event landing in the reject→report window on all build profiles.
With a non-zero coalesce interval the drain loop can race the self-rewriting hot file in a way that loses the pending rejection (the pre-existing edge case noted in #29740), which on release builds manifests as a 30 s timeout. A zero interval makes the drain loop non-blocking — it polls once and processes whatever is already buffered — which is the closest analogue to the pre-loop behaviour this test was tuned against, and keeps the self-write event landing in the reject→report window on all build profiles.
What
Runtime fix for the
--hotsourcemap race that #29735 works around at the test level. Two changes:VirtualMachine.reload()now also defers whenpending_internal_promiseis.rejectedbut its error hasn't been printed yet (pending_internal_promise_reported_at != hot_reload_counter), not just when.pending. The deferred reload runs fromreportExceptionInHotReloadedModuleIfNeeded()after the error is remapped and printed against its own sourcemap.SavedSourceMap.putMappings()keeps the existing table entry when the incomingInternalSourceMaphas zero mappings. A 0-mapping map can never answer a lookup, so dropping it is never worse than installing it; this defends against any other path that re-transpiles a comment-only partial read.Why
vm.source_mappingsis a path-hash → blob table overwritten in place on every transpile. The event-loop tick drains microtasks between tasks, so a watcher event that arrives after a module's eval rejects but beforereportExceptionInHotReloadedModuleIfNeeded()prints it can run anotherreload()— which re-reads the file (possibly mid-rewrite, since a 2MBwriteFileSyncis truncate + severalwrite()s) and overwrites the table entry. The still-unreported error is then either remapped against the wrong map (transpiled coords leak through, e.g.:1:12when the source is line 1003) or dropped entirely when the newpending_internal_promisereplaces the old one.This was flaking on aarch64 in
should work with sourcemap generation(see #29735) and can also affect users whose editors save non-atomically.Test
The new test in
test/cli/hot/hot.test.tsmakes the window deterministic: the hot file truncates itself to a comment-only stub immediately before throwing, guaranteeing a fresh watcher event lands between reject and report.bun bd test test/cli/hot/hot.test.ts -t "should not remap against a stale sourcemap"— 1 pass, 41 expect() calls (20 iterations)USE_SYSTEM_BUN=1 bun test test/cli/hot/hot.test.ts -t "should not remap against a stale sourcemap"— fails (the error is dropped and the test times out waiting for it)bun bd test test/cli/hot/hot.test.ts -t sourcemap— all 4 sourcemap tests passbun run zig:check-all— cleanNote: the self-rewriting hot file in the new test occasionally exposes a separate, pre-existing edge case where an entry that rewrites itself mid-eval and then throws can lose the error (reproduces on released bun too). I observed this as a rare hang on debug builds when running under heavy parallel load; it did not reproduce in serial
bun bd testruns. If it surfaces in CI it's worth a follow-up — the root cause is independent of this change.