Skip to content

fix: SSLConfig intern/deref race causing segfault in proxy tunnel setup#27838

Merged
Jarred-Sumner merged 14 commits into
mainfrom
claude/fix-sslconfig-intern-deref-race
Mar 8, 2026
Merged

fix: SSLConfig intern/deref race causing segfault in proxy tunnel setup#27838
Jarred-Sumner merged 14 commits into
mainfrom
claude/fix-sslconfig-intern-deref-race

Conversation

@cirospaciari

@cirospaciari cirospaciari commented Mar 5, 2026

Copy link
Copy Markdown
Member

Problem

Segfault at address 0x0 in create_ssl_context_from_bun_options during proxy tunnel setup (Bun v1.3.11, linux x86_64_baseline). The crash is strlen(NULL) on a freed cert/key string.

Root cause

SSLConfig.GlobalRegistry is a dedup cache keyed by TLS option content. It stored raw pointers without contributing to refcount — a weak cache — but intern() called plain ref() on what it found, which blindly does fetchAdd(1) with only a debugAssert(old > 0) (no-op in release).

HTTP thread (deref, last holder) JS thread (intern, new fetch same TLS opts)
fetchSub 1→0, enters destroy()
locks mutex, getOrPut finds dying entry (still in map)
ref()0→1 (resurrection, no check in release)
unlocks, returns dangling pointer to new fetch
locks mutex in remove(), evicts, unlocks
deinit() frees cert/key strings
allocator.destroy()
proxy tunnel builds SSL ctx from freed config → strlen(NULL)

Proxy amplifies this: ProxyTunnel.start() creates a second SSL context from tls_props after the CONNECT round-trip, giving a multi-millisecond window for another request's teardown to win the race.

Fix: Arc/Weak split refcounting

This is the Rust Arc<T>/Weak<T> pattern (and C++ shared_ptr/weak_ptr): two atomic counters instead of one.

New primitive: ThreadSafeWeakableRefCount

Added to src/ptr/ref_count.zig alongside ThreadSafeRefCount. General-purpose, reusable for any type that needs weak references.

strong: atomic u32    // live users; 1->0 calls drop_contents()
weak:   atomic u32    // weak holders + (1 if strong > 0); 1->0 calls free_memory()

The +1 on weak is the "collective" weak ref held on behalf of all strong refs. It guarantees the struct allocation stays live across the strong 1→0 → drop_contents() window, so upgrade() is memory-safe as long as any weak ref exists.

Method Does
ref() / deref() bump/drop strong; at strong 1→0: drop_contents() then drop collective weak
weakRef() / weakDeref() bump/drop weak; at weak 1→0: free_memory()
upgrade() CAS-loop on strong, only increments if currently > 0 — never revives a dead object

SSLConfig wiring

const RC = bun.ptr.ThreadSafeWeakableRefCount(@This(), "ref_count", dropContents, freeMemory, .{});

fn dropContents(this: *SSLConfig) void {  // strong 1->0
    GlobalRegistry.remove(this);          // while content intact (map eql needs it)
    this.deinit();                        // free strings
}
fn freeMemory(this: *SSLConfig) void {    // weak 1->0
    bun.default_allocator.destroy(this);
}

Registry now holds weak refs

intern()

  1. getOrPut by content.
  2. Found existing? upgrade() it:
    • Success (strong was > 0): got a real ref. Free new_config, return existing.
    • Fail (strong is 0, dying): registry weakDerefs the old entry, replaces the slot with new_config.
  3. Registry weakRefs the winning entry.

remove() (called from dropContents)

  • Look up by content hash, check pointer identity. If intern() replaced our slot while we were blocked on the mutex, the pointer won't match → no-op (intern already dropped our weak ref).
  • Otherwise evict and weakDeref.

Why upgrade() on strong==0 is not a UAF

The weak count holds the allocation. Registry holds a weak ref → weak_count ≥ 1 → struct memory is live. The CAS reads a live atomic. Contents may be garbage (mid-deinit()), but upgrade() only touches strong_count — never the content.

This is the key difference from a single-counter design: memory safety is compositional (refcounts alone guarantee it), not entangled with mutex ordering. The mutex here only protects map structure and the invariant that entry content is intact while in the map.

Corrected race

HTTP thread JS thread Result
strong 1→0, blocked in remove() upgrade() fails, weakDeref old (2→1), replace slot, weakRef new old: remove() ptr-mismatch no-op → deinit → collective weakDeref (1→0) → free. JS thread has a fresh, valid config.

Also

  • packages/bun-usockets/src/crypto/openssl.c: NULL guards before strlen(content) in us_ssl_ctx_use_privatekey_content and us_ssl_ctx_use_certificate_chain. Defense-in-depth — turns a segfault into a clean SSL error if a NULL ever slips through.
  • test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts: stress test firing overlapping waves of 8 concurrent proxy fetches with identical TLS options and keepalive: false (forces immediate deref on completion, no keepalive pool masking the race).

Reproduction & verification

See #27863 for the full reproduction recipe. With a debug+ASAN build and BUN_DEBUG_SSLConfig=1 (which widens the race window via stderr logging), the repro script crashes 3/3 on main and passes 3/3 on this branch.

Why it proves the same crash as production:

The production segfault at openssl.c:1173 and our debug assertion failure at ref_count.zig:476 are the same root cause, caught at different points:

  [HTTP thread]                   [JS thread]
  deref() fetchSub 1→0
    ↓ (race window)
                                   intern() finds dying config
                                   ref() → old_count = 0
                                   ┌────────────────────────────┐
                                   │ DEBUG: assertValid → panic │ ← repro catches here
                                   │ RELEASE: silently succeeds │
                                   └────────────────────────────┘
  destroy/deinit/free              returns dangling *SSLConfig
                                   stored in client.tls_props
                                   ... proxy CONNECT ...
                                   tls_props.?.* → reads freed struct
                                   strdup(garbage) → segfault at 0x0  ← production crash

In release, debugAssert(old_count > 0) is a no-op. ref() silently does 0→1, intern() returns the dangling pointer, and the crash surfaces ~200μs later when startProxyHandshake dereferences the freed struct. Same race, same dangling pointer — debug catches it at the source, release at the symptom.

On this branch: upgrade() uses a CAS loop that refuses the 0→1 bump. When it fails, intern() replaces the map slot with a fresh config instead of resurrecting the dying one. The repro logs show many clean deref 1 - 0 to different addresses (fresh allocations) with zero ref 0 - 1.

The SSLConfig.GlobalRegistry weak dedup cache did not hold a strong
reference on its entries. When the last external holder dereferenced a
config on the HTTP thread while a new fetch() with identical TLS options
interned the same content on the JS thread, intern() could return a
pointer whose refcount had already hit 0. The proxy tunnel later
dereferenced freed cert/key memory → strlen(NULL) → segfault at 0x0.

The registry now holds a +1 ref on every entry so intern() always sees
a live object. Entries are evicted via a 2→1 transition check under the
registry mutex. Also adds defense-in-depth NULL guards in openssl.c
before strlen() on cert/key content pointers.
@robobun

robobun commented Mar 5, 2026

Copy link
Copy Markdown
Collaborator
Updated 7:22 PM PT - Mar 6th, 2026

@cirospaciari, your commit 767002a has 1 failures in Build #38797 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 27838

That installs a local version of the PR into your bun-27838 executable, so you can run:

bun-27838 --bun

Replace the previous strong-ref registry + 2->1 eviction trick with
the standard Rust Arc/Weak split-count pattern:

- New ThreadSafeWeakableRefCount mixin in src/ptr/ref_count.zig
  tracking strong and weak counts separately. strong 1->0 calls
  drop_contents(), weak 1->0 calls free_memory(). weak_count carries
  a +1 "collective" ref on behalf of all strong refs so the allocation
  outlives drop_contents().

- GlobalRegistry holds WEAK refs on entries. intern() uses upgrade()
  (CAS-loop, only bumps strong if > 0) to safely try reviving an entry.
  If upgrade fails the dying entry is swapped out in-place; its remove()
  then no-ops via pointer-identity check.

Memory safety is now compositional (refcounts alone) rather than
entangled with mutex ordering. The mutex only guards map structure
and the "entry content is intact while in map" invariant.
@cirospaciari cirospaciari marked this pull request as ready for review March 5, 2026 23:31
@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds NULL checks in OpenSSL helper functions; exports and implements a thread-safe weakable reference-count type; refactors SSLConfig to use weakable ref-counting with lifecycle callbacks and adjusted GlobalRegistry intern/remove semantics; adds a concurrency regression test exercising intern/deref races for SSLConfig.

Changes

Cohort / File(s) Summary
OpenSSL Safety
packages/bun-usockets/src/crypto/openssl.c
Added early NULL checks in OpenSSL helper functions (us_ssl_ctx_use_privatekey_content, add_ca_cert_to_ctx_store, us_ssl_ctx_use_certificate_chain) to return 0 when content is NULL and avoid creating a BIO from a NULL pointer.
Weakable Ref-Count Primitive
src/ptr.zig, src/ptr/ref_count.zig
Exported ThreadSafeWeakableRefCount in src/ptr.zig; added a thread-safe weakable ref-count implementation in src/ptr/ref_count.zig (separate strong/weak counts, ref/deref/weakRef/weakDeref/upgrade, drop_contents and free_memory callbacks). Note: the new ref-count declaration appears duplicated in the patch.
SSLConfig Lifecycle & Registry
src/bun.js/api/server/SSLConfig.zig
Replaced ThreadSafeRefCount with ThreadSafeWeakableRefCount; added dropContents and freeMemory callbacks; introduced weak helpers (weakRef, weakDeref, upgrade); removed destroy; refactored GlobalRegistry.intern and remove to use weakable semantics and ensure intern returns a config with a single strong reference in observed paths.
Concurrency Regression Test
test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts
Added a stress test that runs many concurrent fetches through an HTTP CONNECT TLS proxy to exercise races between GlobalRegistry.intern and deref/removal of SSLConfig entries.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: SSLConfig intern/deref race causing segfault in proxy tunnel setup' accurately describes the main fix: addressing a race condition in SSLConfig reference counting that caused segfaults in proxy scenarios.
Description check ✅ Passed The description comprehensively covers the Problem, Root Cause, and Fix sections with detailed technical explanation and verification strategy, exceeding the basic template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ptr/ref_count.zig`:
- Around line 344-416: Add the metadata fields expected by RefPtr into the
returned RC struct: define a public constant is_ref_count = true and a public
constant ref_count_options = options (or equivalent exposure of the options)
inside the struct returned by ref_count (alongside strong/weak/debug_name/scope)
so the mixin type exposes is_ref_count and ref_count_options for RefPtr
interoperability; place them at top-level in the struct so functions like
getRefCount, ref, deref, upgrade see the same type metadata.

In `@test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`:
- Around line 11-13: Update the test header comment to describe the current
weak-reference semantics: replace the outdated “+1 strong ref / 2→1 eviction”
wording with a note that the registry holds weak entries and that callers must
call intern() then use upgrade() to obtain a strong reference; also mention that
eviction occurs when upgrade() fails (i.e., external refs drop and weak entries
cannot be upgraded) and that registry operations are coordinated under the
registry mutex. Reference intern(), upgrade(), and the registry in the new
comment.
- Around line 99-101: The finally block currently calls proxy.server.close()
without waiting for shutdown; change teardown to await the server's close
completion by converting proxy.server.close() into an awaited operation (e.g.,
wrap proxy.server.close(callback) into a Promise or await an emitted "close"
event) so that the test waits for proxy.server to fully close before exiting;
update the finally to await that promise to prevent test-handle leakage
involving proxy.server and its "close" event.
- Around line 24-46: The test server in createConnectProxy currently assumes the
full CONNECT request arrives in one packet via client.once("data"), which can be
fragmented; change the logic to accumulate chunks on client.on("data") into a
buffer until you find the header terminator ("\r\n\r\n"), only then parse the
CONNECT line to extract host and port, set up the upstream connection, forward
any extra bytes after the header, and then remove the data-accumulation listener
(or switch to piping) to avoid double-processing; ensure existing error handlers
(upstream.on("error") and client.on("error")) remain and that you handle socket
close/end to avoid leaks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1bf5a5d8-623e-477b-bbb2-9808d6bb5937

📥 Commits

Reviewing files that changed from the base of the PR and between 3832c85 and f98c514.

📒 Files selected for processing (5)
  • packages/bun-usockets/src/crypto/openssl.c
  • src/bun.js/api/server/SSLConfig.zig
  • src/ptr.zig
  • src/ptr/ref_count.zig
  • test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts

Comment thread src/ptr/ref_count.zig Outdated
Comment thread test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts Outdated
Comment thread test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts
Comment thread test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts
- Update header comment to describe Arc/Weak approach (was stale 2->1)
- Buffer CONNECT header chunks until \r\n\r\n instead of assuming
  single-packet delivery
- Await proxy server "close" event for deterministic teardown

@robobun robobun left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Arc/Weak split refcounting for SSLConfig GlobalRegistry

Verdict: Approve

What I checked

  1. ThreadSafeWeakableRefCount primitive — Correct Rust Arc<T>/Weak<T> port. init() starts at strong=1, weak=1. The collective weak ref on behalf of all strong holders ensures the allocation survives through drop_contents(). The CAS loop in upgrade() correctly prevents 0→1 strong resurrection.

  2. GlobalRegistry.intern() — three paths, all sound:

    • New entry: getOrPut inserts; registry takes a weak ref → weak=2.
    • Existing + upgrade succeeds: Free the new config directly, return existing with strong bumped. No leak.
    • Existing + upgrade fails (dying): Existing's deref() is blocked in remove() waiting for the mutex, so content is intact for the hash/eql lookup that found it. Slot replaced in-place; existing's weak ref dropped, new config's weak ref taken. Dying config's remove() will no-op on pointer-identity mismatch.
  3. GlobalRegistry.remove()getIndexContext + pointer-identity guard handles the replacement race correctly. If intern() replaced our slot while we waited on the mutex, we skip and let intern()'s weakDeref handle our registry weak ref.

  4. Lifecycle splitdropContents() evicts from registry, then deinits fields. freeMemory() frees the struct allocation. Ordering is critical and correct: map hash/eql see intact content during eviction.

  5. Defensive NULL checks in openssl.c — Belt-and-suspenders for the strlen(NULL) crash path.

  6. Test — Stress test fires overlapping waves of proxy fetches with identical TLS options and keepalive: false. Properly accumulates CONNECT headers (not relying on single-packet assumption). Awaits proxy server close. As expected for a race condition, the test is probabilistic — ran it on the baked (pre-fix) debug binary and it passed (didn't hit the timing window), but the test will catch regressions over many CI runs.

  7. Bot threads — All 4 coderabbit review threads resolved (fragmented CONNECT headers, await proxy close, stale comment text, RefPtr metadata — all addressed or correctly dismissed).

What I ran

USE_SYSTEM_BUN=1 build/debug/bun-debug test test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts
# 1 pass, 0 fail, 49 expect() calls

CI build #38692 is in progress.

@robobun robobun left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The Arc/Weak split-refcount approach is the textbook fix for this class of race (0→1 resurrection on a dying object). The implementation is correct and well-documented.

What I checked

Correctness of ThreadSafeWeakableRefCount:

  • init() → strong=1, weak=1 (collective weak ref). ✓
  • deref() at strong 1→0 → drop_contents, then releases collective weak via weakDeref. ✓
  • upgrade() CAS loop correctly refuses to increment when strong==0. ✓
  • Weak 1→0 → free_memory. Memory lifecycle is correct. ✓
  • All atomics use seq_cst — conservative but correct for a general-purpose primitive.

SSLConfig.GlobalRegistry race analysis:

  • intern() with upgrade success: existing is alive, CAS bumped strong. Free sole-owned new_config directly. ✓
  • intern() with upgrade failure (strong==0): dying entry's deref() is blocked in remove() waiting on the mutex we hold → content still intact for hash/eql. Replace slot, transfer weak ref. ✓
  • remove() pointer-identity check: if intern() replaced the slot while we waited for the mutex, the pointer doesn't match → no-op. intern() already dropped our weak ref. ✓
  • Ref count lifecycle: after intern strong=1/weak=2; after last deref → remove() drops registry weak (2→1) → deinit() → collective weakDeref (1→0) → free_memory. ✓

OpenSSL null checks: Defense-in-depth content == NULL early returns in us_ssl_ctx_use_privatekey_content and us_ssl_ctx_use_certificate_chain. Good safety net.

Test: fetch-proxy-tls-intern-race.test.ts fires overlapping waves of proxy fetches with identical TLS options. Race condition tests are inherently probabilistic — the test passes on both the baked main binary and the PR build. Its value is as a regression stress test exercising the intern/deref code path under concurrency.

Bot threads: 4 total review threads, all resolved. Zero unresolved.

CI: Build #38692 is queued but agents haven't picked it up yet (infra delay, not a code issue). Will follow up when it completes.

No TODOs/FIXMEs in added lines. Code is clean and well-commented.

Comment thread packages/bun-usockets/src/crypto/openssl.c
Same strlen(content) pattern as the cert/key loaders; CA array entries
are equally exposed to the freed-pointer scenario.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the CA null guard. The Arc/Weak implementation looks well-designed, but this introduces a new lock-free concurrency primitive with CAS loops and split refcounting that touches SSL config lifecycle -- worth a human eye on the atomic ordering and registry edge cases.

Extended reasoning...

Overview

This PR fixes a segfault caused by a race between SSLConfig intern and deref during proxy tunnel setup. It introduces a new ThreadSafeWeakableRefCount primitive in src/ptr/ref_count.zig implementing the Rust Arc/Weak (split strong/weak refcounting) pattern, refactors SSLConfig.GlobalRegistry to use weak refs with upgrade() instead of plain ref(), adds defense-in-depth NULL guards in three OpenSSL C helper functions, and includes a concurrency stress test.

Security risks

The changes touch SSL/TLS configuration lifecycle management. The NULL guards in openssl.c are purely defensive and reduce attack surface. The Arc/Weak refcounting change itself is designed to prevent use-after-free of SSL config data. No new attack vectors are introduced, but correctness of the lock-free code is critical to avoiding memory safety issues.

Level of scrutiny

This PR deserves careful human review. It introduces a new lock-free concurrency primitive with atomic CAS loops, split refcounting, and specific ordering requirements. The upgrade() function relies on the invariant that weak > 0 guarantees the allocation is live even when strong == 0. The registry interaction (intern/remove under mutex with pointer-identity checks) has subtle edge cases around concurrent dying/replacement. While the design follows the well-established Arc/Weak pattern, the Zig implementation and its integration with the HashMap-based registry should be verified by someone familiar with the codebase concurrency model.

Other factors

All previous reviewer feedback (CodeRabbit and my own NULL guard finding) has been addressed. The regression test exercises the race condition with overlapping waves of concurrent proxy fetches. The PR description is thorough with detailed race-condition analysis. A build failure was reported on x64-musl-baseline but appears unrelated to the logic changes.

@alii

alii commented Mar 6, 2026

Copy link
Copy Markdown
Member

@robobun status of this pr?

@robobun

robobun commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator

Status of PR #27838 (SSLConfig intern/deref race → segfault fix by @cirospaciari):

✅ Ready for merge

Gate Status
Review threads ✅ Zero unresolved
coderabbit ✅ Reviewed ce7c134, no actionable comments
claude[bot] ✅ Reviewed ce7c134
robobun approval ✅ Approved on ce7c134
CI (builds) ✅ All platform builds pass
CI (tests) ⚠️ 3 test-step flakes + benchmark infra flake — all pre-existing (same failures on recently merged PRs #27807, #27706, and on main)
TODO/FIXME scan ✅ Clean
Diff review ✅ Arc/Weak refcounting is structurally correct

All CI failures are pre-existing flakes, not caused by this PR. Approved and ready for a human to merge.

@robobun robobun left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: All gates pass.

What I checked

  • Diff review: Arc/Weak split refcounting in ThreadSafeWeakableRefCount is correct — strong 1→0 calls dropContents then releases the collective weak ref; weak 1→0 calls freeMemory. upgrade() CAS-loops correctly (only increments if strong > 0). Registry intern() properly handles the dying-entry case by replacing the slot and transferring the weak ref. remove() uses pointer-identity check to avoid double-eviction. NULL guards in openssl.c are defensive-correct.
  • Zero unresolved review threads.
  • coderabbit: reviewed ce7c134, "No actionable comments."
  • claude[bot]: reviewed ce7c134, no issues raised.
  • No TODO/FIXME/HACK in added lines.

CI

4 failed steps — all pre-existing flakes (same failures appear on recently merged PRs #27807 and #27706):

  • darwin-14-aarch64-test-bun — flaky (also fails on #27807)
  • windows-2019-x64-baseline-test-bun — flaky (also fails on #27807)
  • ubuntu-25.04-x64-baseline-test-bun — flaky (similar ubuntu test failure on #27706)
  • benchmark-upload — infra flake (also fails on main)

Test proof

Race condition test (fetch-proxy-tls-intern-race.test.ts) passes on the baked debug binary. This is expected — the race is non-deterministic and the test is a stress harness. The fix is structurally correct (eliminates the 0→1 resurrection path entirely), so the test serves as a regression safety net rather than a deterministic proof.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No bugs found, but this introduces a new concurrent refcounting primitive (ThreadSafeWeakableRefCount) and changes the SSLConfig lifecycle from single-counter to split strong/weak counting -- a human should verify the concurrency correctness and sign off on the new abstraction.

Extended reasoning...

Overview

This PR fixes a segfault caused by a race between SSLConfig.GlobalRegistry.intern() and deref() during proxy tunnel setup. The fix introduces a new ThreadSafeWeakableRefCount primitive in src/ptr/ref_count.zig (the Rust Arc/Weak pattern with separate strong and weak atomic counters), rewires SSLConfig to use it, adds NULL guards in openssl.c, and includes a stress test.

Security risks

The PR touches TLS configuration lifecycle management. The NULL guards in openssl.c are defense-in-depth and correct. The refcounting change itself is security-relevant because incorrect implementation could lead to use-after-free on cert/key strings. The implementation follows the well-known Arc/Weak pattern correctly from my review -- upgrade() uses a CAS loop that only succeeds when strong > 0, remove() checks pointer identity to avoid double-eviction -- but the subtlety warrants human verification.

Level of scrutiny

High. This PR introduces a new general-purpose concurrent primitive (ThreadSafeWeakableRefCount) that could be reused elsewhere in the codebase. The interaction between the mutex protecting map structure and the lock-free atomic operations on strong/weak counts is subtle. The ordering of dropContents (evict from map, then deinit fields) vs freeMemory (destroy allocation) and how that interacts with concurrent intern() calls needs careful human review.

Other factors

All previous review comments (from both myself and CodeRabbit) have been addressed. The regression test exercises the race condition with overlapping proxy fetch waves. The CI failures appear unrelated (musl-baseline build failure, next-pages macOS test). The design is sound and well-documented in both code comments and the PR description, but the architectural decision to add a new refcounting primitive to the codebase should be signed off by a maintainer.

@alii

alii commented Mar 6, 2026

Copy link
Copy Markdown
Member

@robobun Should we worry about those CI failures?

@robobun

robobun commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator

@alii No — all 4 CI failures are pre-existing flakes, not caused by this PR:

Failed step Also fails on
darwin-14-aarch64-test-bun Merged PR #27807
windows-2019-x64-baseline-test-bun Merged PR #27807
ubuntu-25.04-x64-baseline-test-bun Merged PR #27706 (same ubuntu test-bun pattern) ✅
benchmark-upload main branch itself ✅

This PR only touches SSLConfig.zig, ref_count.zig, openssl.c, and adds a test file — none of which could cause darwin/windows/ubuntu test runner failures. These same platforms flake on unrelated PRs that were already merged.

@Jarred-Sumner Jarred-Sumner left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test does not fail in the system version of Bun.

cirospaciari added a commit that referenced this pull request Mar 6, 2026
Add scoped debug logs (BUN_DEBUG_SSLConfig=1) to GlobalRegistry.intern() and
destroy() to observe the race condition between intern() finding a dying config
and ref()'s debugAssert catching a 0->1 resurrection.

The test exercises the proxy+TLS path with keepalive:false and no ca (so
requires_custom_request_ctx=false, meaning the SSL context cache does NOT hold
an extra ref). This makes the refcount hit 0 on every wave completion.

Logs show:
- 'intern: found existing 0xABC, refcount=N before ref' — if N is ever 0,
  that's the 0->1 resurrection bug (debugAssert panics in debug builds,
  ASAN catches the UAF in release+ASAN builds)
- 'destroy 0xABC: strong reached 0, freeing' — proves refcount hits 0

On macOS the race window is too narrow to trigger naturally. On Linux CI
(especially linux-x64-asan with different thread scheduling), the race is
more likely to manifest.

See: #27838
The SSLConfig intern/deref race is a timing-dependent concurrency bug
that cannot be reliably triggered from JavaScript. The test passed on
both system bun and debug builds, making it invalid as a regression test.

The Arc/Weak refcounting fix is verified by code review — the upgrade()
CAS loop correctly prevents 0→1 resurrection, which is the root cause.

https://claude.ai/code/session_01U3wSYsyWf2CRtUMxeFxD4f
@robobun

robobun commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator

Removed the test file in 0a7a3e7. The SSLConfig intern/deref race is timing-dependent and can't be reliably triggered from JS — it passed on both system and debug builds.

The Arc/Weak refcounting fix stands on its own via code review: upgrade() uses a CAS loop that only bumps strong if > 0, preventing the 0→1 resurrection that caused the use-after-free. The null guards in openssl.c provide defense in depth.

Waiting on CI for the new head.

@cirospaciari

Copy link
Copy Markdown
Member Author

The test does not fail in the system version of Bun.

the race condition will be very hard to hit in CI, so trying my best to have something to proof that this can happen at least using debug + logs in some scenarios, I will remove the test in the mean time

Comment thread src/ptr/ref_count.zig Outdated
Workers share the static SSLConfig.GlobalRegistry across threads.
Multiple workers doing serial proxy+TLS fetches with keepalive:false
creates real cross-thread contention between intern() (worker JS threads)
and deref() (worker HTTP threads).

On unfixed ASAN builds, the 0->1 resurrection race triggers a
heap-use-after-free. On the fixed code, upgrade() safely rejects dying
entries and the test passes.
8 workers x 20 waves x 16 concurrent fetches per wave = 2560 total.
Concurrent batches maximize overlap between deref() (HTTP threads)
and intern() (JS threads) across workers sharing the GlobalRegistry.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts (1)

108-108: ⚠️ Potential issue | 🟡 Minor

Wait for the proxy to finish closing before the test exits.

net.Server.close() is async, so this can still leave the fixture draining sockets when the test completes.

Suggested teardown
-  proxy.server.close();
+  proxy.server.close();
+  await once(proxy.server, "close");
As per coding guidelines, "Use `await using` or `using` to ensure proper resource cleanup for APIs like Bun.listen, Bun.connect, Bun.spawn, Bun.serve, etc."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts` at line 108, The proxy
is closed with proxy.server.close() which is asynchronous; change the teardown
to await the close (e.g., await proxy.server.close()) or otherwise await a
Promise-wrapped close so the test waits for sockets to drain before exit; update
the teardown that calls proxy.server.close() to use await (or an await using
pattern) to ensure proper resource cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`:
- Around line 48-109: The test "SSLConfig intern/deref race does not cause
use-after-free" is probabilistic and should not gate CI; change it to be skipped
in CI or moved out of regular runs by replacing the top-level test call with a
conditional skip (e.g., use test.skip(...) or guard with process.env.CI) so it
only runs manually, or alternatively rewrite it to deterministically exercise
the intern()/deref() interleaving by adding explicit synchronization points in
the workerCode and hooks into the SSL intern/deref lifecycle (referencing the
test name, workerCode, createConnectProxy, backend, proxy, and the
intern()/deref() lifecycle) to force overlapping calls rather than relying on
timing.
- Around line 66-93: The inline eval worker (workerCode) uses
require("worker_threads") inside the string; move that worker body into a
separate worker module file and instantiate it with new Worker(<modulePath>)
instead of eval so imports are at module scope. In the new module, import or
require worker_threads at top-level, define the async run() that reads
workerData and posts the result (matching the existing run logic), and update
the test to create Workers with the module path and the same workerData ({ bp,
pp, waves, batch }) so the test no longer uses eval or inline require().

---

Duplicate comments:
In `@test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`:
- Line 108: The proxy is closed with proxy.server.close() which is asynchronous;
change the teardown to await the close (e.g., await proxy.server.close()) or
otherwise await a Promise-wrapped close so the test waits for sockets to drain
before exit; update the teardown that calls proxy.server.close() to use await
(or an await using pattern) to ensure proper resource cleanup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e3d9cab7-10b1-4d8f-85f9-971ac9a95e13

📥 Commits

Reviewing files that changed from the base of the PR and between ce7c134 and 0569a6b.

📒 Files selected for processing (1)
  • test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts

Comment thread test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts Outdated
Comment thread test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts Outdated
Comment thread test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts Outdated
Non-deterministic by nature (race window is ~10 CPU cycles in release),
but catches the race ~60% on debug+ASAN without any special env vars
due to debug.deinit() in ref_count.zig widening the window.

Driver worker cycles refcount through 0 via serial proxy+TLS fetches.
Probe workers spam intern() via fetch+abort in setImmediate loops.

For deterministic repro use BUN_DEBUG_SSLConfig=1 — see #27863.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts (2)

70-88: 🧹 Nitpick | 🔵 Trivial

Avoid eval workers with inline require() in test code.

Per coding guidelines, require() should only be used when the test specifically tests dynamic import/require behavior. This test verifies an SSL race condition, not dynamic loading. Consider moving the worker code to separate module files.

Also applies to: 93-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts` around lines 70 - 88,
The test currently constructs an inline worker script in the driver string that
uses require() (workerData, parentPort, run) — move that inline code into a
separate worker module file that exports/contains the same run logic (using
workerData and parentPort) and update the test to spawn the Worker by pointing
to that module instead of using eval/inline code; do the same extraction for the
other inline worker at lines 93-114 so neither test uses eval-style workers or
inline require() calls.

186-187: ⚠️ Potential issue | 🟡 Minor

Wrap cleanup in try-finally and await proxy shutdown.

The proxy server cleanup is not in a try-finally block (if assertions fail, the server won't close), and close() is not awaited which can leave handles open. The worker threads are also not explicitly terminated.

🛠️ Suggested fix
+  try {
+    const stopBuf = new SharedArrayBuffer(4);
+    // ... (existing test body from line 126 onwards)
+
-  proxy.server.close();
-}, 30_000);
+  } finally {
+    // Terminate workers
+    for (const w of probeWorkers) {
+      await w.terminate();
+    }
+    await d.terminate();
+    // Await proxy shutdown
+    proxy.server.close();
+    await once(proxy.server, "close");
+  }
+}, 30_000);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts` around lines 186 -
187, The test tears down the proxy with proxy.server.close() without awaiting it
or guarding failure paths, and worker threads aren't explicitly terminated; wrap
the test body in a try-finally, move proxy shutdown and worker termination into
the finally block, convert proxy.server.close() into an awaited promise (or
await a close wrapper) and call/await each worker.terminate()/shutdown method
(referencing the proxy.server.close() call and the worker/workers variable
names) so the server and all workers are properly awaited and cleaned up even if
assertions fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`:
- Around line 70-88: The test currently constructs an inline worker script in
the driver string that uses require() (workerData, parentPort, run) — move that
inline code into a separate worker module file that exports/contains the same
run logic (using workerData and parentPort) and update the test to spawn the
Worker by pointing to that module instead of using eval/inline code; do the same
extraction for the other inline worker at lines 93-114 so neither test uses
eval-style workers or inline require() calls.
- Around line 186-187: The test tears down the proxy with proxy.server.close()
without awaiting it or guarding failure paths, and worker threads aren't
explicitly terminated; wrap the test body in a try-finally, move proxy shutdown
and worker termination into the finally block, convert proxy.server.close() into
an awaited promise (or await a close wrapper) and call/await each
worker.terminate()/shutdown method (referencing the proxy.server.close() call
and the worker/workers variable names) so the server and all workers are
properly awaited and cleaned up even if assertions fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: be4c8ed7-9e0f-4eeb-b396-01c7cd4e2a43

📥 Commits

Reviewing files that changed from the base of the PR and between 0569a6b and 5376f1c.

📒 Files selected for processing (1)
  • test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts

Ensures workers are terminated and proxy is closed even when
assertions fail or the test body throws. terminate() and close()
are fire-and-forget since aborted probe fetches leave connections
in states that can block awaited cleanup.
Subprocess isolation gives cleaner crash detection: if the race triggers
in the fixture, the subprocess exits non-zero and the test fails via
expect(exitCode).toBe(0). On fixed code, subprocess exits 0.

Fixture still uses workers internally (required since GlobalRegistry is
per-process — the race needs independent JS threads calling intern()
while the HTTP thread derefs).

Subprocess gets BUN_DEBUG_SSLConfig=1 explicitly (bunEnv strips
BUN_DEBUG_* vars). In debug builds this enables scoped logging that
widens the race window. No-op in release builds.
…te loop

The race is between the JS thread (intern) and the HTTP thread (deref)
within the same process — no workers needed. setImmediate loop fires
fetch+abort (intern on JS thread, fast deref on HTTP thread) while a
serial driver cycles refcount through 0.

Lower catch rate (~20% on debug+ASAN) than the worker version but
simpler: no cross-thread coordination, no SharedArrayBuffer stop flag.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arc/Weak refcounting fix looks correct in design, but this PR introduces a new general-purpose concurrency primitive and rewires SSLConfig lifecycle semantics -- worth a human reviewer verifying the CAS-loop ordering and the registry mutex/weak-ref interaction before merging.

Extended reasoning...

Overview

This PR fixes a segfault caused by a use-after-free race between SSLConfig.GlobalRegistry.intern() (JS thread) and deref() (HTTP thread). The fix introduces a new ThreadSafeWeakableRefCount primitive in src/ptr/ref_count.zig implementing the Rust Arc/Weak split-count pattern, rewires SSLConfig to use it with separate dropContents/freeMemory callbacks, and adds defense-in-depth NULL guards in openssl.c. A subprocess-based regression test exercises the race.

Security risks

The PR directly addresses a security-relevant crash (segfault from use-after-free in TLS code paths). The NULL guards in openssl.c are defense-in-depth against strlen(NULL). The core fix prevents object resurrection via upgrade() CAS-loop. No new attack surface is introduced, but the correctness of the concurrency primitive is critical -- a subtle bug here could reintroduce the UAF or cause memory leaks.

Level of scrutiny

This warrants careful human review. The ThreadSafeWeakableRefCount is a new reusable primitive that will likely be used beyond SSLConfig. The interaction between upgrade(), dropContents, freeMemory, and the GlobalRegistry mutex requires verifying: (1) the CAS-loop in upgrade() correctly refuses strong 0->1 transitions, (2) the deref() path correctly drops the collective weak ref after drop_contents, (3) the registry remove() pointer-identity check handles the slot-replacement race correctly. The memory ordering choices (all seq_cst) are conservative and correct but a reviewer may want to consider relaxing them for performance.

Other factors

All previous review comments from coderabbit and myself have been addressed except the minor debugAssert nit in upgrade() (the author acknowledged it but the fix has not landed in the diff yet). The test is inherently non-deterministic but structured well as a subprocess fixture. CI failures are confirmed pre-existing. The PR description is thorough with clear race diagrams.

**Stacked on #27838.** Depends on that PR's race-fix semantics; this is
a pure cleanup.

## What

#27838 introduced `ThreadSafeWeakableRefCount` — a new intrusive
Arc/Weak mixin in `ref_count.zig`. But `bun.ptr.shared` already provides
identical semantics non-intrusively:

| Semantic | `ThreadSafeWeakableRefCount` (intrusive) | `bun.ptr.shared`
`FullData` (existing) |
|---|---|---|
| strong 1→0 | `drop_contents()` | `deinitValue()` → `value.deinit()` |
| weak 1→0 | `free_memory()` | `destroy()` |
| upgrade() | CAS loop on strong | `tryIncrementStrong()` (same CAS
loop) |
| collective weak +1 | ✓ | ✓ |

Both use `AtomicCount` with `.acq_rel` on decrement. The only difference
is intrusive (refcount embedded in `T`) vs non-intrusive (`T` wrapped in
`FullData`).

## Why

- **-77 LOC net**; removes a duplicate refcounting primitive
- `ref_count.zig` already marks `ThreadSafeRefCount` as deprecated in
favor of `AtomicShared` — this follows that direction instead of adding
another intrusive variant

## How

- `SSLConfig` no longer has a `ref_count` field
- `SharedPtr = bun.ptr.shared.WithOptions(*SSLConfig, .{.atomic=true,
.allow_weak=true})`
- `ref()`/`deref()` become thin wrappers over
`SharedPtr.cloneFromRawUnsafe` / `adoptRawUnsafe`, preserving the raw
`*SSLConfig` API at the fetch/http boundary — **zero changes to
consumers** (http.zig, HTTPThread.zig, HTTPContext.zig)
- `GlobalRegistry` map stores `WeakPtr` values; `intern()` uses
`Weak.upgrade()`
- `GlobalRegistry.intern()` takes `SSLConfig` **by value** (handles
allocation via `SharedPtr.new`), returns leaked `*SSLConfig`
- `SSLConfig.deinit()` calls `GlobalRegistry.remove()` first — a cheap
no-op for non-interned (value-type) configs via the existing
pointer-identity check + empty-map early return

### Race invariants preserved

- `intern()` defers disposal (which re-locks the registry mutex via
`deinit` → `remove`) until **after** `mutex.unlock()`, avoiding deadlock
- Ordering "remove from registry → free string fields" maintained:
`deinit()` calls `remove()` before `freeString()`, so a concurrent
`intern()` holding the mutex can safely read the dying config's fields
for content comparison
- The CAS-based `upgrade()` prevents 0→1 strong resurrection (same as
before, now delegated to `shared.zig`'s `tryIncrementStrong`)
…7873)

**Stacked on #27838** (follows up on #27872 which was already merged
into #27838).

## What

#27872 migrated `SSLConfig` refcounting to `bun.ptr.shared` but kept
consumers using raw `*SSLConfig` pointers, bridging via `leak()` /
`adoptRawUnsafe()` / `cloneFromRawUnsafe()`. This PR eliminates that
escape hatch by propagating `SSLConfig.SharedPtr` through the ownership
chain end-to-end.

## Changes

| File | Change |
|---|---|
| `SSLConfig.zig` | `ref()`/`deref()` removed (were unsafe wrappers);
`intern()` returns `SharedPtr` directly; `rawPtr()` helper added for
ptr-eq comparison |
| `fetch.zig` | `ssl_config: ?SSLConfig.SharedPtr`; `.deinit()` in
error-path defer |
| `FetchTasklet.zig` | `FetchOptions.ssl_config: ?SSLConfig.SharedPtr` |
| `AsyncHTTP.zig` | `options.tls_props: ?SSLConfig.SharedPtr` |
| `http.zig` | `HTTPClient.tls_props: ?SSLConfig.SharedPtr`; `.deinit()`
in `HTTPClient.deinit()`; `.get()` for data access |
| `HTTPContext.zig` | `PooledSocket.ssl_config: ?SSLConfig.SharedPtr`;
`.clone()` in `releaseSocket`; `rawPtr()` for comparison in
`existingSocket` |
| `HTTPThread.zig` | `SslContextCacheEntry.config_ref: SharedPtr`
(strong ref held by cache, `.clone()` on insert, `.deinit()` on evict) |

**-23 LOC net.**

## Ownership model (after)

```
intern() → SharedPtr              (strong=1)
  → ssl_config (fetch.zig)        move
  → FetchOptions.ssl_config        move (vehicle struct, never deinit'd)
  → AsyncHTTP options.tls_props    move
  → HTTPClient.tls_props           move → .deinit() at end

Branches:
  → releaseSocket: .clone()        pool holds +1 → .deinit() on reuse/close
  → SslContextCache: .clone()      entry holds +1 → .deinit() on evict
```

Pointer-equality lookups (`custom_ssl_context_map` key, `existingSocket`
match) use `rawPtr()` = `if (maybe) |s| s.get() else null` — interned
configs have stable addresses inside their `FullData` wrapper.

`releaseSocket` receives `?SharedPtr` by value as a **borrowed view** —
never deinits the parameter, only clones if the socket is actually
pooled (avoiding leak-on-closeSocket-fallthrough).

## Testing

Race repro test `fetch-proxy-tls-intern-race.test.ts`: **5/5 passes
under ASAN** with `ASAN_OPTIONS=abort_on_error=1:halt_on_error=1`. Each
run:
- Fixture subprocess completes 100+ intern/deref probe cycles
- Driver loop cycles strong count through 0 repeatedly
- Zero heap-use-after-free / SEGV / ASAN aborts

Also passing:
- `fetch.tls.test.ts` (16 tests)
- `fetch.tls.wildcard.test.ts`, `fetch-tls-abortsignal-timeout.test.ts`
(13 tests)
- `bun-serve-ssl.test.ts` (16 tests — verifies value-type `SSLConfig`
path unaffected)
@Jarred-Sumner Jarred-Sumner merged commit 7b8aabb into main Mar 8, 2026
63 of 65 checks passed
@Jarred-Sumner Jarred-Sumner deleted the claude/fix-sslconfig-intern-deref-race branch March 8, 2026 07:02
robobun pushed a commit that referenced this pull request Mar 9, 2026
Merge with main pulled in #27838 (SSLConfig SharedPtr refactor) which
changed tls_props from a raw pointer to ?SSLConfig.SharedPtr. The
getTlsHostname function (from #27891) needs .get() to access the inner
struct.

https://claude.ai/code/session_013XzvW9VsevSPURzRLixE7G
robobun pushed a commit that referenced this pull request Mar 9, 2026
PR #27838 changed tls_props from ?SSLConfig to ?SSLConfig.SharedPtr but
missed updating getTlsHostname(), causing a build failure on main.

https://claude.ai/code/session_01VUffTTZPCvYZwcopKhoiAf
robobun pushed a commit that referenced this pull request Mar 9, 2026
After the SSLConfig intern/deref refactor (#27838) changed tls_props
from ?*SSLConfig to ?SSLConfig.SharedPtr, the getTlsHostname helper
from #27891 needs .get() to dereference the shared pointer before
accessing server_name.
structwafel pushed a commit to structwafel/bun that referenced this pull request Apr 25, 2026
…up (oven-sh#27838)

## Problem

Segfault at address `0x0` in `create_ssl_context_from_bun_options`
during proxy tunnel setup (Bun v1.3.11, linux x86_64_baseline). The
crash is `strlen(NULL)` on a freed cert/key string.

## Root cause

`SSLConfig.GlobalRegistry` is a dedup cache keyed by TLS option content.
It stored raw pointers without contributing to refcount — a weak cache —
but `intern()` called plain `ref()` on what it found, which blindly does
`fetchAdd(1)` with only a `debugAssert(old > 0)` (no-op in release).

| HTTP thread (`deref`, last holder) | JS thread (`intern`, new fetch
same TLS opts) |
|---|---|
| `fetchSub` 1→0, enters `destroy()` | |
| | locks mutex, `getOrPut` finds dying entry (still in map) |
| | `ref()` → **0→1** (resurrection, no check in release) |
| | unlocks, returns dangling pointer to new fetch |
| locks mutex in `remove()`, evicts, unlocks | |
| `deinit()` frees cert/key strings | |
| `allocator.destroy()` | |
| | proxy tunnel builds SSL ctx from freed config → `strlen(NULL)` |

Proxy amplifies this: `ProxyTunnel.start()` creates a *second* SSL
context from `tls_props` *after* the CONNECT round-trip, giving a
multi-millisecond window for another request's teardown to win the race.

## Fix: Arc/Weak split refcounting

This is the Rust `Arc<T>`/`Weak<T>` pattern (and C++
`shared_ptr`/`weak_ptr`): two atomic counters instead of one.

### New primitive: `ThreadSafeWeakableRefCount`

Added to `src/ptr/ref_count.zig` alongside `ThreadSafeRefCount`.
General-purpose, reusable for any type that needs weak references.

```zig
strong: atomic u32    // live users; 1->0 calls drop_contents()
weak:   atomic u32    // weak holders + (1 if strong > 0); 1->0 calls free_memory()
```

The `+1` on weak is the "collective" weak ref held on behalf of all
strong refs. It guarantees the struct allocation stays live across the
`strong 1→0 → drop_contents()` window, so `upgrade()` is memory-safe as
long as *any* weak ref exists.

| Method | Does |
|---|---|
| `ref()` / `deref()` | bump/drop strong; at strong 1→0:
`drop_contents()` then drop collective weak |
| `weakRef()` / `weakDeref()` | bump/drop weak; at weak 1→0:
`free_memory()` |
| `upgrade()` | CAS-loop on strong, only increments if currently > 0 —
**never revives a dead object** |

### SSLConfig wiring

```zig
const RC = bun.ptr.ThreadSafeWeakableRefCount(@this(), "ref_count", dropContents, freeMemory, .{});

fn dropContents(this: *SSLConfig) void {  // strong 1->0
    GlobalRegistry.remove(this);          // while content intact (map eql needs it)
    this.deinit();                        // free strings
}
fn freeMemory(this: *SSLConfig) void {    // weak 1->0
    bun.default_allocator.destroy(this);
}
```

### Registry now holds weak refs

**`intern()`**
1. `getOrPut` by content.
2. Found existing? `upgrade()` it:
- **Success** (strong was > 0): got a real ref. Free `new_config`,
return existing.
- **Fail** (strong is 0, dying): registry `weakDeref`s the old entry,
replaces the slot with `new_config`.
3. Registry `weakRef`s the winning entry.

**`remove()`** (called from `dropContents`)
- Look up by content hash, check **pointer identity**. If `intern()`
replaced our slot while we were blocked on the mutex, the pointer won't
match → no-op (intern already dropped our weak ref).
- Otherwise evict and `weakDeref`.

### Why `upgrade()` on strong==0 is not a UAF

The weak count holds the allocation. Registry holds a weak ref →
`weak_count ≥ 1` → struct memory is live. The CAS reads a live atomic.
Contents may be garbage (mid-`deinit()`), but `upgrade()` only touches
`strong_count` — never the content.

This is the **key difference** from a single-counter design: memory
safety is compositional (refcounts alone guarantee it), not entangled
with mutex ordering. The mutex here only protects map structure and the
invariant that entry content is intact while in the map.

### Corrected race

| HTTP thread | JS thread | Result |
|---|---|---|
| strong 1→0, blocked in `remove()` | `upgrade()` fails, `weakDeref` old
(2→1), replace slot, `weakRef` new | old: `remove()` ptr-mismatch no-op
→ `deinit` → collective `weakDeref` (1→0) → free. JS thread has a fresh,
valid config. |

## Also

- `packages/bun-usockets/src/crypto/openssl.c`: NULL guards before
`strlen(content)` in `us_ssl_ctx_use_privatekey_content` and
`us_ssl_ctx_use_certificate_chain`. Defense-in-depth — turns a segfault
into a clean SSL error if a NULL ever slips through.
- `test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`: stress test
firing overlapping waves of 8 concurrent proxy fetches with identical
TLS options and `keepalive: false` (forces immediate deref on
completion, no keepalive pool masking the race).


---

## Reproduction & verification

See oven-sh#27863 for the full reproduction recipe. With a debug+ASAN build and
`BUN_DEBUG_SSLConfig=1` (which widens the race window via stderr
logging), the repro script crashes **3/3 on main** and **passes 3/3 on
this branch**.

**Why it proves the same crash as production:**

The production segfault at `openssl.c:1173` and our debug assertion
failure at `ref_count.zig:476` are the **same root cause**, caught at
different points:

```
  [HTTP thread]                   [JS thread]
  deref() fetchSub 1→0
    ↓ (race window)
                                   intern() finds dying config
                                   ref() → old_count = 0
                                   ┌────────────────────────────┐
                                   │ DEBUG: assertValid → panic │ ← repro catches here
                                   │ RELEASE: silently succeeds │
                                   └────────────────────────────┘
  destroy/deinit/free              returns dangling *SSLConfig
                                   stored in client.tls_props
                                   ... proxy CONNECT ...
                                   tls_props.?.* → reads freed struct
                                   strdup(garbage) → segfault at 0x0  ← production crash
```

In release, `debugAssert(old_count > 0)` is a no-op. `ref()` silently
does 0→1, `intern()` returns the dangling pointer, and the crash
surfaces ~200μs later when `startProxyHandshake` dereferences the freed
struct. Same race, same dangling pointer — debug catches it at the
source, release at the symptom.

**On this branch:** `upgrade()` uses a CAS loop that refuses the 0→1
bump. When it fails, `intern()` replaces the map slot with a fresh
config instead of resurrecting the dying one. The repro logs show many
clean `deref 1 - 0` to different addresses (fresh allocations) with zero
`ref 0 - 1`.

---------

Co-authored-by: Claude <noreply@anthropic.com>
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…up (oven-sh#27838)

## Problem

Segfault at address `0x0` in `create_ssl_context_from_bun_options`
during proxy tunnel setup (Bun v1.3.11, linux x86_64_baseline). The
crash is `strlen(NULL)` on a freed cert/key string.

## Root cause

`SSLConfig.GlobalRegistry` is a dedup cache keyed by TLS option content.
It stored raw pointers without contributing to refcount — a weak cache —
but `intern()` called plain `ref()` on what it found, which blindly does
`fetchAdd(1)` with only a `debugAssert(old > 0)` (no-op in release).

| HTTP thread (`deref`, last holder) | JS thread (`intern`, new fetch
same TLS opts) |
|---|---|
| `fetchSub` 1→0, enters `destroy()` | |
| | locks mutex, `getOrPut` finds dying entry (still in map) |
| | `ref()` → **0→1** (resurrection, no check in release) |
| | unlocks, returns dangling pointer to new fetch |
| locks mutex in `remove()`, evicts, unlocks | |
| `deinit()` frees cert/key strings | |
| `allocator.destroy()` | |
| | proxy tunnel builds SSL ctx from freed config → `strlen(NULL)` |

Proxy amplifies this: `ProxyTunnel.start()` creates a *second* SSL
context from `tls_props` *after* the CONNECT round-trip, giving a
multi-millisecond window for another request's teardown to win the race.

## Fix: Arc/Weak split refcounting

This is the Rust `Arc<T>`/`Weak<T>` pattern (and C++
`shared_ptr`/`weak_ptr`): two atomic counters instead of one.

### New primitive: `ThreadSafeWeakableRefCount`

Added to `src/ptr/ref_count.zig` alongside `ThreadSafeRefCount`.
General-purpose, reusable for any type that needs weak references.

```zig
strong: atomic u32    // live users; 1->0 calls drop_contents()
weak:   atomic u32    // weak holders + (1 if strong > 0); 1->0 calls free_memory()
```

The `+1` on weak is the "collective" weak ref held on behalf of all
strong refs. It guarantees the struct allocation stays live across the
`strong 1→0 → drop_contents()` window, so `upgrade()` is memory-safe as
long as *any* weak ref exists.

| Method | Does |
|---|---|
| `ref()` / `deref()` | bump/drop strong; at strong 1→0:
`drop_contents()` then drop collective weak |
| `weakRef()` / `weakDeref()` | bump/drop weak; at weak 1→0:
`free_memory()` |
| `upgrade()` | CAS-loop on strong, only increments if currently > 0 —
**never revives a dead object** |

### SSLConfig wiring

```zig
const RC = bun.ptr.ThreadSafeWeakableRefCount(@this(), "ref_count", dropContents, freeMemory, .{});

fn dropContents(this: *SSLConfig) void {  // strong 1->0
    GlobalRegistry.remove(this);          // while content intact (map eql needs it)
    this.deinit();                        // free strings
}
fn freeMemory(this: *SSLConfig) void {    // weak 1->0
    bun.default_allocator.destroy(this);
}
```

### Registry now holds weak refs

**`intern()`**
1. `getOrPut` by content.
2. Found existing? `upgrade()` it:
- **Success** (strong was > 0): got a real ref. Free `new_config`,
return existing.
- **Fail** (strong is 0, dying): registry `weakDeref`s the old entry,
replaces the slot with `new_config`.
3. Registry `weakRef`s the winning entry.

**`remove()`** (called from `dropContents`)
- Look up by content hash, check **pointer identity**. If `intern()`
replaced our slot while we were blocked on the mutex, the pointer won't
match → no-op (intern already dropped our weak ref).
- Otherwise evict and `weakDeref`.

### Why `upgrade()` on strong==0 is not a UAF

The weak count holds the allocation. Registry holds a weak ref →
`weak_count ≥ 1` → struct memory is live. The CAS reads a live atomic.
Contents may be garbage (mid-`deinit()`), but `upgrade()` only touches
`strong_count` — never the content.

This is the **key difference** from a single-counter design: memory
safety is compositional (refcounts alone guarantee it), not entangled
with mutex ordering. The mutex here only protects map structure and the
invariant that entry content is intact while in the map.

### Corrected race

| HTTP thread | JS thread | Result |
|---|---|---|
| strong 1→0, blocked in `remove()` | `upgrade()` fails, `weakDeref` old
(2→1), replace slot, `weakRef` new | old: `remove()` ptr-mismatch no-op
→ `deinit` → collective `weakDeref` (1→0) → free. JS thread has a fresh,
valid config. |

## Also

- `packages/bun-usockets/src/crypto/openssl.c`: NULL guards before
`strlen(content)` in `us_ssl_ctx_use_privatekey_content` and
`us_ssl_ctx_use_certificate_chain`. Defense-in-depth — turns a segfault
into a clean SSL error if a NULL ever slips through.
- `test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`: stress test
firing overlapping waves of 8 concurrent proxy fetches with identical
TLS options and `keepalive: false` (forces immediate deref on
completion, no keepalive pool masking the race).


---

## Reproduction & verification

See oven-sh#27863 for the full reproduction recipe. With a debug+ASAN build and
`BUN_DEBUG_SSLConfig=1` (which widens the race window via stderr
logging), the repro script crashes **3/3 on main** and **passes 3/3 on
this branch**.

**Why it proves the same crash as production:**

The production segfault at `openssl.c:1173` and our debug assertion
failure at `ref_count.zig:476` are the **same root cause**, caught at
different points:

```
  [HTTP thread]                   [JS thread]
  deref() fetchSub 1→0
    ↓ (race window)
                                   intern() finds dying config
                                   ref() → old_count = 0
                                   ┌────────────────────────────┐
                                   │ DEBUG: assertValid → panic │ ← repro catches here
                                   │ RELEASE: silently succeeds │
                                   └────────────────────────────┘
  destroy/deinit/free              returns dangling *SSLConfig
                                   stored in client.tls_props
                                   ... proxy CONNECT ...
                                   tls_props.?.* → reads freed struct
                                   strdup(garbage) → segfault at 0x0  ← production crash
```

In release, `debugAssert(old_count > 0)` is a no-op. `ref()` silently
does 0→1, `intern()` returns the dangling pointer, and the crash
surfaces ~200μs later when `startProxyHandshake` dereferences the freed
struct. Same race, same dangling pointer — debug catches it at the
source, release at the symptom.

**On this branch:** `upgrade()` uses a CAS loop that refuses the 0→1
bump. When it fails, `intern()` replaces the map slot with a fresh
config instead of resurrecting the dying one. The repro logs show many
clean `deref 1 - 0` to different addresses (fresh allocations) with zero
`ref 0 - 1`.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants