tls: enforce the SSLContextCache owning-thread invariant with debug assertions#31933
tls: enforce the SSLContextCache owning-thread invariant with debug assertions#31933robobun wants to merge 4 commits into
Conversation
|
Updated 1:24 PM PT - Jun 17th, 2026
❌ @robobun, your commit d8594ae has 3 failures in
🧪 To try this PR locally: bunx bun-pr 31933That installs a local version of the PR into your bun-31933 --bun |
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughAdds debug-only owning-thread tracking to SSLContextCache (captured at construction) and inserts debug assertions at cache entry points, the ex-data free callback, and Drop to ensure operations run on the owning JS thread. Tests added for SSL_CTX deduplication after GC and HTTP/2 session churn under GC pressure. ChangesThread ownership enforcement in SSLContextCache
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/node/tls/ssl-ctx-cache.test.ts`:
- Around line 225-237: oneSession() currently closes the HTTP/2 session only on
the success path and resolves on the "response" event (headers) rather than full
request completion; update it so that the request's "end" (or stream "end") is
awaited before resolving and ensure cleanup runs on all paths: attach error
handlers that reject the promise, move session.close() into a finally block, and
await once(session, "close") in that finally so the function only returns after
the request fully finishes or fails and the session is always closed; refer to
the oneSession function, the session and req variables,
Promise.withResolvers/resolve/reject, and the once(session, "close") usage when
making the changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b29f4092-46ad-4321-bf27-af44bf5c8ee2
📒 Files selected for processing (3)
packages/bun-usockets/src/crypto/openssl.csrc/runtime/api/bun/SSLContextCache.rstest/js/node/tls/ssl-ctx-cache.test.ts
There was a problem hiding this comment.
Both prior issues are addressed (release-arm dead_code, tombstone-adopt baseline re-capture) and the current diff looks correct — deferring since this stakes out a position against #31883's strong-ref approach and a maintainer should sign off on that call.
Extended reasoning...
Overview
This PR hardens SSLContextCache.rs with debug-only owning-thread assertions, corrects the module doc comment about what the mutex does/doesn't protect, fixes a stale "Defined in Zig" comment in openssl.c, and adds two regression tests (tombstone-adopt path, node:http2 session churn under GC pressure) to ssl-ctx-cache.test.ts.
The Rust change is: drop #[derive(Default)] for a manual impl Default that captures std::thread::current().id() into a #[cfg(debug_assertions)]-gated field, plus a debug_assert_owner_thread() helper called from get_or_create_digest, bun_ssl_ctx_cache_on_free, and Drop. In release builds the field doesn't exist and the helper is an empty #[inline] function, so release behavior is unchanged.
Security risks
None introduced. The PR doesn't touch cert validation, handshake logic, cipher config, or any release-build codepath. It's debug instrumentation + comments + tests around an existing cache.
Level of scrutiny
Moderate. The mechanical change is small and low-risk (debug-only assertions following the VirtualMachine.debug_thread_id precedent), but the PR explicitly takes a design position — assertions instead of #31883's strong-ref approach — backed by a detailed consumer audit in the description. That audit's conclusion (the resurrection race is unreachable today) is what justifies assertion-only over a behavior change, and a maintainer familiar with the SSL_CTX consumer landscape should validate it.
Other factors
- Both inline issues I raised on the prior revision are fixed in the current head: the release-build
dead_code = "deny"failure (the#[cfg(not(debug_assertions))] owner_thread: ()arm is gone entirely) and the tombstone-adopt test's stale-baseline flake (now re-capturesdrainedafter the GC loop). - CodeRabbit's
oneSession()cleanup concern is also addressed (try/finally withclosedarmed upfront, resolve onend). - CI failures on the latest build are infrastructure (Windows ARM64 agent creation, musl LTO data-layout) plus an unrelated
bunx.test.ts— none touch the changed files. - The new http2 churn test relies on GC-driven
sslCtxLiveCount()deltas with a<= 2headroom bound; that follows the same pattern as the four existing tests in this file, so flakiness risk is no worse than what's already shipping.
There was a problem hiding this comment.
Both earlier issues are fixed in the current diff — nothing further from me. Deferring to a human since this encodes a threading-model decision for the TLS SSL_CTX cache (debug assertions over the strong-ref alternative in #31883) that a maintainer should sign off on.
Extended reasoning...
Overview
Three files touched: src/runtime/api/bun/SSLContextCache.rs (adds a debug-only owner_thread: ThreadId field captured in Default::default(), a debug_assert_owner_thread() helper, and calls to it from get_or_create_digest, the bun_ssl_ctx_cache_on_free ex-data callback, and Drop; rewrites the module doc to correctly describe why the mutex cannot defend an off-thread SSL_CTX_free), packages/bun-usockets/src/crypto/openssl.c (comment-only: fixes the stale "Defined in Zig" reference and notes the Rust-side thread assertion), and test/js/node/tls/ssl-ctx-cache.test.ts (two new tests: tombstone-adopt path and a 30-session node:http2 churn under interleaved GC).
My two prior inline findings — the release-build dead_code = "deny" failure on the private ()-typed owner_thread field, and the post-GC toBe(before + 1) flake — are both fixed in the current diff (the release arm is dropped entirely so the field exists only under #[cfg(debug_assertions)]; the test re-captures drained after the GC loop and asserts toBe(drained + 1)). CodeRabbit's oneSession() cleanup concern is also applied: closed is armed up front, resolution is on req.once("end"), and session.close() runs in finally. The new entry.owner.debug_assert_owner_thread() call site uses the same BackRef Deref path the existing entry.owner.mutex.lock_guard() line already relies on (src/ptr/lib.rs:187).
Security risks
No new attack surface: release builds are byte-for-byte behaviorally unchanged (the only executable additions are gated on #[cfg(debug_assertions)]), the openssl.c change is comment-only, and the tests are additive. The PR is about a security-relevant invariant (a 0→1 SSL_CTX_up_ref resurrection UAF if a cache-managed ctx's last ref were ever dropped off-thread) but documents/asserts it rather than changing the mechanism.
Level of scrutiny
Medium-high. The diff itself is mechanically simple, but it lives in TLS/crypto infrastructure and explicitly encodes a design choice — keep the cache weak and enforce single-threaded last-ref-free via debug assertion — over the alternative proposed in #31883 (make the cache hold strong refs to close the race structurally, at the cost of pinning one SSL_CTX per distinct config for the VM lifetime). That trade-off, and the consumer audit in the PR description backing the "unreachable today" claim, is the kind of architectural call a maintainer should ratify rather than a bot.
Other factors
All reviewer threads on the PR are resolved, the bug-hunting pass on the current head found nothing, and the PR author reports the full test/js/node/tls/ and test/js/node/http2/ suites pass on a debug/ASAN build with the new assertions never firing. Given the security-adjacent file and the design-decision nature, I'm deferring rather than approving.
|
Status for reviewers: the diff is final and the branch is ready. Build 63168 on the rebased head d8594ae has one error-level failure and three warning-level failures, none of which touch this diff:
Grepping the full annotation set for All review threads are resolved; the remaining open item is the design call documented in the description (keep the weak cache and enforce the owning-thread invariant with debug assertions, versus the strong-ref approach in #31883), which needs a maintainer's sign-off. |
…ssertions The cache doc comment claimed the mutex protects against a future caller that frees a cache-managed SSL_CTX off the JS thread. It cannot: BoringSSL decrements the refcount before the CRYPTO_EX_free callback takes the lock, so a concurrent get_or_create could still see a live entry and up_ref 0 to 1, resurrecting a context whose destruction already committed. Correct the comment and assert in debug builds that get_or_create, the tombstone callback, and Drop all run on the thread that constructed the cache. Adds two churn tests: the tombstone-adopt path, and node:http2 new-session-per-request churn with interleaved GC.
The release arm's unit placeholder field is private and never read, which fails the release build under deny(dead_code). Also close the http2 churn helper's session on error paths and resolve on request end instead of response headers.
The sweeps can also reclaim stragglers from earlier tests, so asserting against the pre-loop baseline could fail when the loop lands below it.
… change User-facing tls.createSecureContext() now builds its SSL_CTX privately (bypassing the per-digest cache) so addCACert cannot leak across contexts; the internal connect paths still go through the cache. The http2 churn test already exercises the cache's tombstone/rebuild path via tls.connect.
5e8b2a9 to
d8594ae
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/bun-usockets/src/crypto/openssl.c`:
- Around line 293-304: The second call to i2d_SSL_SESSION when assigning
pending->length does not validate its return value before casting to uint32_t.
If the second i2d_SSL_SESSION call fails and returns a value less than or equal
to 0, the cast will produce an invalid unsigned length value that causes buffer
over-read when memcpy uses pending->length later. Validate the return value from
the second i2d_SSL_SESSION call, verify it is positive and matches the length
from the first call, and if validation fails, free the pending allocation and
return 0 to prevent the buffer overflow vulnerability.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 46f76570-9f30-4414-b5bd-d1cbf39de410
📒 Files selected for processing (2)
packages/bun-usockets/src/crypto/openssl.csrc/runtime/api/bun/SSLContextCache.rs
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/bun-usockets/src/crypto/openssl.c`:
- Around line 293-304: The second call to i2d_SSL_SESSION when assigning
pending->length does not validate its return value before casting to uint32_t.
If the second i2d_SSL_SESSION call fails and returns a value less than or equal
to 0, the cast will produce an invalid unsigned length value that causes buffer
over-read when memcpy uses pending->length later. Validate the return value from
the second i2d_SSL_SESSION call, verify it is positive and matches the length
from the first call, and if validation fails, free the pending allocation and
return 0 to prevent the buffer overflow vulnerability.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 46f76570-9f30-4414-b5bd-d1cbf39de410
📒 Files selected for processing (2)
packages/bun-usockets/src/crypto/openssl.csrc/runtime/api/bun/SSLContextCache.rs
🛑 Comments failed to post (1)
packages/bun-usockets/src/crypto/openssl.c (1)
293-304:
⚠️ Potential issue | 🟠 Major🧩 Analysis chain
🌐 Web query:
In BoringSSL/OpenSSL, cani2d_SSL_SESSION(session, &out)return<= 0on failure, and should callers validate the second encode result even after a prior successfuli2d_SSL_SESSION(session, NULL)size probe?💡 Result:
Yes, i2d_SSL_SESSION can return values <= 0 upon failure [1][2]. Official documentation typically specifies that the function returns the size of the ASN.1 representation in bytes, or 0 if the session is invalid [3][4][5]. However, in practice, it is possible for the function to return negative values or zero in error conditions (such as invalid sessions or internal allocation failures), and callers must account for these non-positive outcomes [1][2]. Regarding your second question: Yes, callers should always validate the result of the second i2d_SSL_SESSION call, even after a successful "probe" call (where pp=NULL) [2]. Relying on the size returned by the first call does not guarantee the second call will succeed. The session object's state could potentially change between calls, or the second operation could fail for other reasons (e.g., memory allocation errors) [1]. Robust implementations should check that the return value of the second call matches the expected size and is strictly greater than 0 before proceeding to use the buffer [2][5]. Top results: [1][2][5]
Citations:
- 1: http://mailman.nginx.org/pipermail/nginx-devel/2017-June/010163.html
- 2: nodejs/node@3cc8fca299
- 3: https://man.openbsd.org/i2d_SSL_SESSION.3
- 4: https://manpages.debian.org/unstable/libssl-doc/i2d_SSL_SESSION.3ssl.en.html
- 5: https://sortix.org/man/man3/i2d_SSL_SESSION.3.html
Validate the second
i2d_SSL_SESSIONresult before assigningpending->length.Line 303 trusts the second encode return unconditionally. If it returns
≤ 0, the cast touint32_tproduces a massive unsigned value that poisons thepending->lengthfield. Line 318'smemcpy(dest, pending->data, pending->length)then reads far beyond the allocated buffer (which holds only the size from the first call), causing buffer over-read and potential crash or information leak.Suggested fix
struct us_ssl_pending_session_t *pending = malloc(sizeof(struct us_ssl_pending_session_t) + (size_t)length); if (!pending) { return 0; } unsigned char *out = pending->data; - pending->length = (uint32_t)i2d_SSL_SESSION(session, &out); + int encoded_len = i2d_SSL_SESSION(session, &out); + if (encoded_len <= 0 || encoded_len > length) { + free(pending); + ERR_clear_error(); + return 0; + } + pending->length = (uint32_t)encoded_len; pending->next = NULL;Per OpenSSL/BoringSSL contract,
i2d_SSL_SESSION()can fail on the second call even after a successful size probe. Always validate the result before use, especially when casting to an unsigned type.🧰 Tools
🪛 Cppcheck (2.21.0)
[error] 302-302: Memory is allocated but not initialized
(uninitdata)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/bun-usockets/src/crypto/openssl.c` around lines 293 - 304, The second call to i2d_SSL_SESSION when assigning pending->length does not validate its return value before casting to uint32_t. If the second i2d_SSL_SESSION call fails and returns a value less than or equal to 0, the cast will produce an invalid unsigned length value that causes buffer over-read when memcpy uses pending->length later. Validate the return value from the second i2d_SSL_SESSION call, verify it is positive and matches the length from the first call, and if validation fails, free the pending allocation and return 0 to prevent the buffer overflow vulnerability.Source: Coding guidelines
|
On the CodeRabbit finding at |
What
Hardens the per-VM weak
SSL_CTXcache (SSLContextCache.rs, added in #30024) around the invariant its safety actually rests on, prompted by the analysis in #31883 / #31881.SSL_CTX_freedropsreferencesto 0 beforeCRYPTO_free_ex_datareachesbun_ssl_ctx_cache_on_free, so between the decrement and the tombstone write a concurrentget_or_createon the JS thread would still seeentry.ctx != nullandSSL_CTX_up_ref0 to 1, resurrecting a ctx whose destruction already committed (use-after-free). The mutex only orders the tombstone write against the load+up_ref; it cannot undo a refcount that already hit 0.get_or_create_digest,bun_ssl_ctx_cache_on_free(which runs on whichever thread drops the last ref), andDropnow assert they run on the thread that constructed the cache (init_runtime_state, i.e. the owning VM's JS thread). Same pattern asVirtualMachine.debug_thread_id/timer::All.thread_id. Release builds are unchanged.openssl.c.Is the resurrection race reachable today?
No, as far as a full consumer audit can tell, which is why this is an assertion and not a behavior change. #31883 proposes making the cache hold strong refs to close the race, but patches the dormant
SSLContextCache.zigreference file, and the premise (anode:http2teardown dropping the last ref off the JS thread) does not hold on main or on v1.3.14:Bun.connect/node:net,upgradeTLS,upgradeDuplexToTLS,SecureContext/tls.connect,defaultClientSslCtx, WebSocket client, SNIaddServerName, Valkey, Postgres, MySQL) release their ref on the VM loop thread or in a JSC finalizer (mutator thread, same thread).HTTPContext.rs), never through the per-VM cache; their ex_data slot is null so the tombstone callback no-ops.SSL_CTXunref runs in the loop drain.SSLsynchronously inus_internal_ssl_detachon the loop thread; there is no deferredSSL_CTXfree list.The assertions make that audit permanent: any future consumer that ships a cache-descended ref to another thread trips the assert in every debug/ASAN run instead of shipping a heisen-UAF. Strong refs were not adopted because they would also pin one
SSL_CTX(cert chain, CA store) per distinct config for the VM's lifetime, a real cost for per-hostname cert servers usingaddServerName, to close a window that is currently unreachable.Tests
Adds a
node:http2new-session-per-request churn test totest/js/node/tls/ssl-ctx-cache.test.ts(30 sessions, 5 concurrent) with interleavedBun.gc(true), the firebase-admin pattern from #31881, asserting every request completes and the liveSSL_CTXcount stays flat.http2.connectgoes throughtls.connect, which still takes the internalinternpath through the per-digest cache, so this covers the cache's tombstone/rebuild lifecycle and exercises the new assertions on every last-ref free in debug builds.This is a regression canary for the cache lifecycle, not a fail-before test: the assertion added here is not reachable from JS today (that is the point of the audit above), so the test passes on current release builds as well. On debug builds it additionally exercises the assertion on every tombstone.
Rebase onto #31155
#31155 routes user-facing
tls.createSecureContext()through a private, uncached path (soaddCACerton one context cannot affect another) while the internal connect paths keep using the per-digest cache. The conflict inopenssl.cwas the new session/keylog infrastructure inserted just above thebun_ssl_ctx_cache_on_freedeclaration; resolved by keeping both. A tombstone-adopt test this PR originally added viacreateSecureContextis dropped: that entry point no longer goes through the cache, so it cannot reach the tombstone path; the http2 churn test (which uses the cached internal path) covers the same lifecycle.Verification: full
test/js/node/tls/andtest/js/node/http2/suites run against this change on a debug (ASAN) build; the assertion never fires. The two failures in those suites reproduce identically on clean main: a hostname-verification test failure, and a SEGV inBIO_ctrlfrom the upgraded-duplex teardown already fixed by #31364.