Skip to content

DEBUG: instrument http2 pipe timeout on darwin (do not merge)#32002

Closed
cirospaciari wants to merge 62 commits into
mainfrom
claude/h2-pipe-debug
Closed

DEBUG: instrument http2 pipe timeout on darwin (do not merge)#32002
cirospaciari wants to merge 62 commits into
mainfrom
claude/h2-pipe-debug

Conversation

@cirospaciari

Copy link
Copy Markdown
Member

Throwaway diagnostic branch for the darwin-aarch64 timeout of test-http2-compat-serverrequest-pipe.js seen on #31991. Runs a single instrumented test on a darwin-only pipeline. Will be closed once the root cause is identified.

Jarred-Sumner and others added 30 commits June 4, 2026 20:17
Bump the compat target from Node 24.3.0 to 26.3.0:
- nodejs-headers 26.3.0, NODE_MODULE_VERSION 147, bootstrap/flake pins
- process.versions.v8 = 14.6.202.34-node.20, modules = 147
- V8 shim updated for 14.6: Isolate roots layout, flattened
  FunctionCallbackInfo exit frame, String Write/WriteOneByte/WriteUtf8 V2 +
  Utf8LengthV2 (legacy exports kept), External::New pointer-tag overload,
  Number::NewFromInt32/NewFromUint32, HandleScope::Extend/DeleteExtensions
- napi_get_value_string_* no longer forms a slice from a null buffer
  pointer when only querying the encoded length
- v8/napi test fixtures migrated to the V2 string APIs
Per https://streams.spec.whatwg.org/#readable-stream-cancel step 5, a BYOB
reader's pending read requests must be closed with {value: undefined,
done: true} when the stream is cancelled. readableStreamClose only settles
default-reader read requests, so a pending byob read() hung forever after
reader.cancel().
Port the observable behavior changes between Node v24.3.0 and v26.3.0:

streams:
- read() with no size returns one buffered chunk at a time in paused mode
- pause()/resume() are no-ops on destroyed streams; compose() moved to the
  prototype and returns the composed Duplex directly
- Duplex.from(async generator) destroy aborts with the error and unblocks a
  parked iterator; duplexPair destroy propagates to the other side
- eos()/finished(): conditional AsyncLocalStorage binding, immediate-result
  precompute for already-settled streams, real errors override AbortError in
  pipeline()
- webstreams adapters: Writable.toWeb no longer hangs on synchronous drain,
  fromWeb writev failures error the stream instead of an unhandled rejection,
  Readable.toWeb supports type 'bytes' (BYOB), Duplex.toWeb readableType
  (DEP0201 for the old alias), ArrayBuffer/SharedArrayBuffer chunk handling,
  brotli decoder errors surface as TypeError with the original code
- write(string, 'buffer') throws ERR_UNKNOWN_ENCODING

http:
- ServerResponse.writeHeader removed (DEP0063 end-of-life)
- upgrade requests with no 'upgrade' listener no longer vanish
- getHeader('set-cookie') returns undefined when absent, and
  setHeader('set-cookie', []) round-trips as a present-but-empty array
- remove dead half-ported proxy symbols in Agent, https.Agent no longer
  shadows createConnection, drain is only emitted when needed
- http2: respond() rejects raw-header arrays instead of sending garbage
  frames, request() on closed sessions uses the right error codes, option
  range errors use the options. prefix, initialWindowSize is capped

Vendored the matching upstream tests and updated the v24-era ones whose
expectations changed.
… images]

The previous commit's message described the full compat bump but the tracked
file changes were lost to a stray hard reset before committing — only the new
header made it in. This restores them:

- nodejs-headers 26.3.0 / NODE_MODULE_VERSION 147, bootstrap + flake pins,
  process.versions.v8 = 14.6.202.34-node.20
- V8 shim for 14.6: Isolate roots layout, flattened FunctionCallbackInfo exit
  frame, String Write/WriteOneByte/WriteUtf8 V2 + Utf8LengthV2 (legacy exports
  kept), External::New pointer-tag overload, Number::NewFromInt32/NewFromUint32,
  HandleScope::Extend/DeleteExtensions; export lists updated for all platforms
- napi_get_value_string_* no longer forms a slice from a null buffer pointer
  when only querying the encoded length
- v8/napi test fixtures migrated to the V2 string APIs

On top of the restoration, review fixes to the shim:

- DeleteExtensions now actually reclaims the slots Extend granted inside a
  closing scope (per-iteration v8::HandleScopes in a long native call no
  longer accumulate handles until the callback returns), with a debug assert
  against out-of-LIFO scope destruction
- WriteUtf8V2 encodes UTF-16 in chunks so the 32-bit counts from the encoder
  can't wrap for >4GB outputs, and uses view() instead of flattening ropes
- Utf8LengthV2 skips the scalar surrogate scan for valid UTF-16 (SIMD
  validate first); WriteV2/WriteOneByteV2 use vectorized copies
- FunctionCallbackInfo exit frame stays on the stack for up to 16 arguments
- version strings come from a single source of truth: REPORTED_NODEJS_V8_VERSION
  is defined from scripts/build/deps/nodejs-headers.ts and used by both
  process-report objects, alongside the existing REPORTED_NODEJS_ABI_VERSION
- _http_outgoing: drop the lazy outputData prototype accessor — reading it
  with `this` set to the prototype (spread, Object.assign, inspection)
  defineProperty'd a single shared array onto the prototype, reinstating the
  cross-instance leak it was meant to prevent, and reads on frozen instances
  threw. Like Node, the prototype now has no outputData property; methods
  lazily init for subclasses that don't chain the constructor.
- _http_outgoing: getRawHeaderNames() now reports a present-but-empty
  set-cookie under its original casing (kEmptySetCookie stores the name);
  setHeader skips the per-call toLowerCase for names that can't be set-cookie.
- ERR_HTTP2_GOAWAY_SESSION moved into the ErrorCode table instead of a
  hand-rolled Error factory. Appended at the end of the list: the table's
  discriminants are index-aligned with the checked-in Rust mirror
  (src/jsc/ErrorCode.rs). Also declared ErrorCode.ts as an input of the
  bundle-modules codegen step — error indices are baked into the bundled JS
  by replacements.ts, and an ErrorCode.ts edit previously left stale numbers
  in the bundles (errors thrown from builtins came out as the wrong code).
- CompressionStream/DecompressionStream share one buffer-source validator via
  newBufferSourceTransformPairFromDuplex instead of two verbatim copies.
- primordials: SafePromiseAllReturnVoid/ReturnArrayLike share one scheduler.
- end-of-stream: deduplicate the immediate-result callback invocation.
- node-stream.test.js: remove a byte-identical duplicate of the
  "node v26 stream semantics" describe block.
- flake.nix: fix stale Node 24 comment.
…tput [build images]

xwin's splat phase moves unpacked SDK files into place with rename(2). With
the cache under the download dir on tmpfs and /opt/winsysroot on the root
filesystem, every move fails with EXDEV ("Cross-device link"), killing the
linux image bakes. Cache next to the output instead and clean it up after.
The previous head commit came from the formatting bot, whose message lacks
the [build images] tag — the pipeline then asked for published v35 images
that don't exist yet.
…ild images]

The private S3 mirror has no v26.3.0 musl tarballs, which 403'd the Alpine
image bakes. nodejs/unofficial-builds publishes both linux-x64-musl and
linux-arm64-musl for current releases (the mirror predates arm64-musl being
available there), so pull straight from it and skip the manual re-upload on
every Node bump.
…ild images]

- process.config.variables: add enable_thin_lto=false and lto_jobs="" to
  match Node 26. Its common.gypi evaluates these in MSVS settings conditions
  and gyp hard-fails on undefined variables, breaking every node-gyp build on
  Windows (v8/napi/dlopen suites).
- The cipher streaming tests assumed read() with no size returns the whole
  buffered ciphertext; since Node 26 it returns one chunk at a time, so the
  fixtures truncated the ciphertext and OpenSSL reported BAD_DECRYPT — real
  Node 26 fails the old fixtures identically. Drain read() in a loop and add
  the second cipher data event Node 26 emits (output verified against Node
  26.3.0 byte-for-byte modulo quote style); drop the stale TODO claiming the
  readable/prefinish order was a bug.
…tions [build images]

Deliberate divergence from Node 26 (nodejs/node#62557 made pause()/resume()
early-return on destroyed streams). Legacy Readable subclasses like fd-slicer
assign `this.destroyed = true` — which hits the prototype setter on modern
streams — right before push(null). With the upstream guard, a piped
destination's 'drain' can no longer resume the source, so the last buffered
chunk is silently dropped and the pipeline never finishes. In practice that
breaks yauzl → extract-zip → puppeteer browser installs and other zip/tar
tooling, and the same hang reproduces on Node 26.3.0 itself.

Keep the Node 24 behavior of letting such streams flush their buffer, and
replace the no-op assertion test with a regression test for the fd-slicer
pattern (verified: extract-zip now extracts a full chrome-headless-shell
archive instead of stopping after the first entries).
…ed [build images]

The dev-server-puppeteer launcher already prefers a system Chromium when one
is installed (CI bootstraps one on every Linux flavor), and several CI
platforms have no Chrome for Testing build at all (linux-arm64,
windows-arm64). The unconditional download during bun install wasted ~150MB
per run and, worse, a half-finished download left in the shared agent cache
by an earlier failed run makes @puppeteer/browsers refuse every later
install ("browser folder exists but the executable is missing"). Set
PUPPETEER_SKIP_DOWNLOAD for the install step when a system browser exists or
the platform can't run the downloaded one.
…, goaway semantics [build images]

- v8: reserve the EscapableHandleScope escape slot at construction (like real
  V8) instead of allocating it inside Escape(). Allocated at Escape() time it
  sits above any in-scope inline handle grants, so the inline ~HandleScope's
  DeleteExtensions swept the just-escaped handle together with the scope. The
  reservation lives in a side registry keyed by the scope's address because
  the V8 ABI leaves exactly one usable word in the object; entries are purged
  when their buffer clears, and a reused stack address overwrites its stale
  entry. New fixture test creates inline Local copies inside the scope before
  escaping. Also: createRawHandleSlot now does its two bookkeeping steps under
  one lock.
- _http_outgoing: the headers getter/setter, appendHeader and
  getRawHeaderNames now agree with the kEmptySetCookie marker — the getter
  mirrors getHeaders(), replacing the header bag or appending a cookie drops
  the marker, and getRawHeaderNames can't report set-cookie twice.
- FetchHeaders.getRawKeys: the HTTPHeaderMap iterator only walks the common
  and uncommon segments while size() also counts set-cookie values, so the
  returned array had trailing holes whenever set-cookie was present; size for
  unique names and append "Set-Cookie" explicitly.
- http2: receiving GOAWAY now mirrors Node — NGHTTP2_NO_ERROR begins a
  graceful close() (request() then throws ERR_HTTP2_GOAWAY_SESSION), any
  other code destroys the session with ERR_HTTP2_SESSION_ERROR. The
  rejected-stream test asserted the old behavior; its expectation now matches
  what a real Node 26 client reports for the same goaway.
- tests: hoist the _http_common require to module scope, drop a redundant
  in-test require, probe the prototype's outputData directly instead of
  spreading the prototype (which invoked unrelated deprecated getters), and
  share the puppeteer skip-download env via a harness helper.
…uest [build images]

Two fallouts from mirroring Node's GOAWAY handling surfaced by CI:

- After a graceful close(), the socket dying left active streams hanging
  forever: #onClose ran stream.close(NGHTTP2_CANCEL) and then detached the
  parser, so the writable side could never finish and 'close' never fired
  (test-http2-server-shutdown-options-errors timed out). Mirror Node's
  closeSession: hard-destroy every stream that survives the cancel pass, on
  both client and server sessions.
- A request created with an already-aborted signal went through the native
  parser, which RST'd a stream the peer never saw — a connection error that
  makes conforming servers reply GOAWAY(INTERNAL_ERROR). The old code
  swallowed that session error; now that goaway destroys the session with
  ERR_HTTP2_SESSION_ERROR like Node, the noise became visible. Like Node,
  never touch the wire: create an id-less stream and destroy it with an
  AbortError on the next tick (the stream reports aborted=true and rstCode
  CANCEL but does not emit 'aborted' since the request never started).

Also for CI on Windows: pass -Denable_lto=false -Denable_thin_lto=false
-Dlto_jobs= to node-gyp in the v8/napi/dlopen test builds. A clang-cl-built
Node 26 reports thin-LTO build flags in process.config, node-gyp copies them
into addon builds, and MSVC's link.exe hard-fails on /opt:lldltojobs. The
defines are no-ops elsewhere. And when puppeteer's browser download can't be
skipped, install into a fresh per-run PUPPETEER_CACHE_DIR (forwarded to the
launcher) so a half-extracted download left in the shared agent cache by an
earlier failed run can't block the install.
The exported ~HandleScope popped the Bun scope stack and cleared its buffer
but left the isolate's HandleScopeData alone. If Extend had granted slots
from that buffer while the scope was current (an addon's inline Local::New
inside e.g. an Array::Iterate callback), next/limit kept pointing into the
cleared buffer; the next inline v8::HandleScope then snapshotted that stale
limit as prev_limit_, and its DeleteExtensions ran deleteGrantsBack against
the enclosing buffer where no grant matches a foreign address — popping every
grant, including handles of still-open outer scopes. Snapshot next/limit at
scope push (stored in the scope's buffer cell — the 3-word V8 ABI leaves no
room in the frame itself) and write them back on pop. New fixture test drives
exactly this through Array::Iterate.

Also gate the duckdb smoke test on NODE_MODULE_VERSION <= 137: the deprecated
duckdb package publishes no prebuilts past ABI 137 (checked npm.duckdb.org
for 1.3.1-1.4.1), and node-pre-gyp's fallback source build is not viable in
CI. The gate can go away if the test migrates to the N-API based
@duckdb/node-api.
Matches node's ordering and semantics (verified against v26.3.0):
validateAbortSignal accepts any object with an 'aborted' property, so a
duck-typed { aborted: true } takes the abort fast path — but objects
without 'aborted' and non-objects throw ERR_INVALID_ARG_TYPE
synchronously, before the fast path can read .aborted.
Azure repeatedly failed to allocate Standard_D4ds_v6 in-region
(AllocationFailed) during image bakes; D4as_v7 is one of the alternatives
Azure's allocation guidance suggests. Build-only VM — CI runner sizes are
unchanged in ci.mjs.
A build without the [build images] tag fell back to published v35 images,
and stale v35 alpine images exist from before the Node 26.3.0 install landed
(bootstrap already said Version: 35 while still installing 24.3.0) — agents
came up with Node 24, failing the version check and building ABI-137 addons
that an ABI-147 runtime rightly refuses to load. Move to v36/v21 so the
poisoned tags can never be selected again.
…r bun [build images]

- Http2Session.close() (client and server) now sends GOAWAY immediately like
  Node (core.js marks the session closed and calls goaway() right away), so
  peers stop routing new work to a draining session; the client goaway()
  gained Node's default arguments. The grpc-js unbind test additionally
  accepts CANCELLED: a call that wins the race onto the draining session
  before the GOAWAY/socket-close is processed is torn down with
  NGHTTP2_CANCEL — the same in-flight teardown Node performs at socket close;
  which of the three statuses you get is pick-timing.
- Run every test-suite node-gyp build under bun (--bun): the gyp -D defines
  could not neutralize the thin-LTO flags a clang-cl-built Node 26 carries in
  process.config.target_defaults (node-gyp copies them into config.gypi and
  MSVC's link.exe fails on /opt:lldltojobs), and the system Node's ABI may
  not match ours at all. Bun reports the same ABI (147) with clean
  target_defaults, so the produced modules load in node 26 unchanged. The
  dlopen fixtures invoke the bun under test explicitly and their beforeAll
  hooks get a real timeout — the builds legitimately exceed the 5s default
  under a debug/ASAN binary.
Same rationale as the previous commit's dlopen/napi change — this invocation
was missed: gyp -D defines can't override the thin-LTO flags a clang-cl-built
Node carries in process.config.target_defaults, so build the node-side module
with bun (--bun) as well; it targets the same ABI (147) and loads in node 26
unchanged. Also drop the now-redundant defines from the bun-side invocation.
…ness [build images]

common.gypi only applies CLANG_CXX_LANGUAGE_STANDARD (gnu++20) to addon
builds when the clang variable is 1, which real Node reports on macOS. With
bun running node-gyp and reporting clang=0, Apple clang compiled addons at
its ancient default standard and every node-gyp build on darwin failed on
<type_traits> (std::is_enum_v) in the Node 26 headers.

The v8 harness now includes the child's exit code in crash messages — the
Windows fixture crashes currently report nothing, and the code
distinguishes an access violation from a DLL load failure.
sharp has no win32-arm64 prebuilt, so its install script falls back to a
node-gyp source build under the system Node — which the clang-cl-built
Node 26 breaks (process.config leaks thin-LTO flags that MSVC's link.exe
rejects with LNK1117). The test exercises package-lock.json migration, not
lifecycle scripts, so pass --ignore-scripts there.
…[build images]

The puppeteer-based next-pages tests still hit flaky per-run Chrome for
Testing downloads on ubuntu/debian x64 (the only CI platforms with neither a
system browser nor a skip gate). Bake google-chrome-stable into those images
and teach the browser lookups about its binary names — with a system browser
present the tests skip the ~300MB download entirely and launch the system
binary, like they already do on alpine.
These were in symbols.txt/.dyn (so Linux and macOS resolved them) but never
in the Windows .def. Node addons delay-load the host's exports, so the gap
only surfaces at the first call: the v8 fixture tests that exercise these
type checks died with the delay-load "procedure not found" SEH status
(0xC06D007F, reported as exit code 127) after printing their first lines.
Manglings verified against clang-cl and the remaining 83 v8 exports audited
for further gaps (none).
…[build images]

The exit-127 crashes on windows-aarch64 print nothing: stdout shows the test
started and stderr is empty. The exe exports every referenced v8 symbol
(checked both pre- and post-61197 artifacts against the addon's full import
set), so the delay-load theory needs narrowing. Markers go to stderr, which
checkSameOutput never compares but does include in crash messages, so the
next CI run names the exact call that dies.
Root cause of the Windows --debug fixture crashes (exit 127), found by
reproducing under wine with the CI binary and a clang-cl-built debug addon:
V8_INLINE members whose bodies live at namespace scope in the headers —
HandleScope::CreateHandle(v8::Isolate*, Address), HandleScope::Initialize,
and Value::QuickIsUndefined/QuickIsNull/QuickIsNullOrUndefined/QuickIsString
— are imported rather than emitted by MSVC debug builds of a dllimport
class. Real Node exports its entire V8 surface so this works there; Bun's
curated export list didn't include them, and the delay-load helper's
"procedure not found" SEH status (0xC06D007F) surfaces as exit code 127 at
the first call. That also explains the seemingly random crash points across
tests: each died at its own first imported-inline call.

Implementations are thin: the CreateHandle overload forwards to the
internal::Isolate* variant (same object), Initialize performs V8's inline
frame setup (snapshot next/limit, level++ — never pushing a Bun scope, like
EscapableHandleScopeBase), and the QuickIs* checks are the corresponding
JSC value checks. Exported on all platforms for symmetry; ELF builds emit
inline copies locally so only Windows strictly needs them. Verified under
wine: with the missing import resolved, test_v8_global,
test_v8_escapable_handle_scope_inline_grants and
test_v8_locals_survive_nested_call all pass with a debug-CRT fixture.
…d images]

A graceful close() sends GOAWAY through the JS default lastStreamId = 0, and
the native parser only fell back to the session's real last stream id for
undefined/null — a literal 0 went on the wire, telling the peer per RFC 9113
§6.8 that no streams were processed and everything is retriable. Node's JS
layer has the same 0 default and relies on its native correction
(lastStreamID <= 0 → nghttp2_session_get_last_proc_stream_id); mirror that
in the parser so every caller (close(), destroy(), user goaway) is covered.

Also: the server session's goaway handler now matches the client's (and
Node's onGoawayData) — NO_ERROR begins a graceful close, any other code
destroys with ERR_HTTP2_SESSION_ERROR; corrected a comment that claimed
duck-typed { aborted } signals throw (validateAbortSignal accepts any object
with an 'aborted' property); and stripped the v8 fixture's temporary stderr
step markers now that the Windows exit-127 is root-caused.
A server session that received a graceful GOAWAY marks itself closed and
defers destruction to the moment its last stream closes. If the peer's
socket disappears before that stream-close arrives (the client sends
GOAWAY and ends the socket in the same tick, and on macOS the frame and
the EOF are processed in one read batch), that moment never comes:
the socket-close handler only called close(), whose closed/destroyed
early-return now makes it a no-op, so the session — and the server's
open-connection count — stayed alive forever and server.close() never
completed. Node's socketOnClose unconditionally tears the session down
(close() + closeSession()); do the same by following close() with
destroy(), mirroring what the client session's #onClose already does.

Fixes the test-http2-compat-serverrequest-pipe.js timeout on the darwin
CI fleet (diagnosed with an instrumented build: all request/response
events completed, server.close() never fired).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/js/node/test/parallel/test-http2-compat-serverrequest-pipe.js (1)

12-120: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove debug-only instrumentation from this upstream-sync Node test before merge.

This adds non-upstream watchdog/event-trail behavior (including forced process.exit(7)), which breaks verbatim-sync expectations for test/js/node/test/parallel/* and can change test behavior independent of runtime correctness. Please revert this file to upstream content and keep diagnostics in a non-sync/debug-only path.

Based on learnings: files under test/js/node/test/parallel/ should remain byte-identical to upstream Node tests, and local debug edits in those files should not be landed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/js/node/test/parallel/test-http2-compat-serverrequest-pipe.js` around
lines 12 - 120, The test has debug-only instrumentation (t0, events, ev(), state
object, watchdog timer with process.exit, dump(), and extra ev(...) calls) that
must be removed — revert the file to the upstream version: delete the watchdog
block (watchdog, watchdog.unref(), dump and process.exit(7)), remove the events
array/t0/ev() helpers and the state object usage, and undo added logging hooks
that reference
state/serverSock/serverSession/serverReq/serverRes/serverDest/clientReq/clientStr/clientSession;
restore original test behavior and event handlers so the file is byte-identical
to upstream.

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/js/node/test/parallel/test-http2-compat-serverrequest-pipe.js`:
- Around line 12-120: The test has debug-only instrumentation (t0, events, ev(),
state object, watchdog timer with process.exit, dump(), and extra ev(...) calls)
that must be removed — revert the file to the upstream version: delete the
watchdog block (watchdog, watchdog.unref(), dump and process.exit(7)), remove
the events array/t0/ev() helpers and the state object usage, and undo added
logging hooks that reference
state/serverSock/serverSession/serverReq/serverRes/serverDest/clientReq/clientStr/clientSession;
restore original test behavior and event handlers so the file is byte-identical
to upstream.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8122a21e-75bc-4f23-878c-812f021ac8c7

📥 Commits

Reviewing files that changed from the base of the PR and between 06d2b24 and 537898b.

📒 Files selected for processing (1)
  • test/js/node/test/parallel/test-http2-compat-serverrequest-pipe.js

@cirospaciari cirospaciari force-pushed the claude/h2-pipe-debug branch from b4a7867 to 63a86ef Compare June 9, 2026 19:21
@cirospaciari

Copy link
Copy Markdown
Member Author

Diagnosis complete — the root cause (headers-only END_STREAM responses never closing the h2 stream in the frame parser) is fixed in #31991 (28c2ab4) and verified on the same darwin agents that reproduced the hang. A latent macOS kqueue FIN-redelivery quirk in uSockets observed during this investigation will be filed separately.

@cirospaciari cirospaciari deleted the claude/h2-pipe-debug branch June 10, 2026 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants