Skip to content

prove: SSLConfig intern/deref race observability (do NOT merge)#27863

Closed
cirospaciari wants to merge 8 commits into
mainfrom
claude/prove-sslconfig-uaf-race
Closed

prove: SSLConfig intern/deref race observability (do NOT merge)#27863
cirospaciari wants to merge 8 commits into
mainfrom
claude/prove-sslconfig-uaf-race

Conversation

@cirospaciari

@cirospaciari cirospaciari commented Mar 6, 2026

Copy link
Copy Markdown
Member

Proof-of-concept branch demonstrating the SSLConfig intern()/deref() race fixed in #27838. Reliably reproduces the crash 3/3 on debug builds.

The race

SSLConfig.GlobalRegistry is a weak-dedup cache shared across threads. deref() (HTTP thread) and intern() (JS thread) can race:

HTTP thread (last ref holder) JS thread (new fetch)
deref()fetchSub makes count 1→0
enters destroy() (race window open) intern()mutex.lock() (wins mutex)
getOrPut finds dying config in map
existing.ref()count 0→1 (UAF)
returns dangling pointer
remove()mutex.lock() (now wins)
deinit() → frees fields, destroy(this)
(memory freed) ...proxy CONNECT completes... this.tls_props.?.* → segfault reading freed memory

debugAssert(old_count > 0) at ref_count.zig:246 is a no-op in release builds — the bump 0→1 succeeds silently.

Crash stack from production (v1.3.11 linux x86_64_baseline):

openssl.c:1173 — create_ssl_context_from_bun_options → strlen(NULL)
SocketContext.zig:245 — SSLWrapper init
ProxyTunnel.zig:277 — startProxyHandshake
http.zig:1609 — handleOnDataHeaders

Reproduction recipe (macOS/Linux)

1. Build bun with ASAN (debug)

cmake -B build/asan -G Ninja -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON
ninja -C build/asan bun-debug

This downloads the prebuilt ASAN WebKit (bun-webkit-{os}-{arch}-debug-asan.tar.gz) — available for macOS arm64 and Linux x64/arm64.

2. Run the repro script

BUN_DEBUG_QUIET_LOGS=1 BUN_DEBUG_SSLConfig=1 ./build/asan/bun-debug run test/js/web/fetch/sslconfig-race-repro.ts

Expected output (on this branch, unfixed code)

[repro] backend=... proxy=...
[sslconfig] 0x304020380 deref 1 - 0
[sslconfig] destroy 0x304020380: strong reached 0, freeing
[sslconfig] 0x304020380   ref 0 - 1                          ← smoking gun
panic: reached unreachable code
ref_count.zig:476 — bun.assert(debug.magic == .valid) in assertValid
ref_count.zig:237 — count.debug.assertValid() in ref
SSLConfig.zig:313 — existing.ref() in intern
fetch.zig:471    — GlobalRegistry.intern

Exit code 133 (SIGTRAP from assert). Reproduced 3/3 runs.

Expected output on #27838 (fixed)

[repro] backend=... proxy=...
[repro] results: [ 2000, 500000+, 500000+ ] (if you see this, race didn't trigger)

Exit code 0. No ref 0 - 1 anywhere in the log. upgrade() refuses the 0→1 CAS and intern replaces the slot with a fresh config instead.


Why BUN_DEBUG_SSLConfig=1 is required

In release builds, the race window between fetchSub(1→0) and mutex.lock() is ~5-10 CPU cycles — too narrow to hit deterministically.

In debug builds with scoped logging enabled, the deref() path does:

  1. fetchSub(1) → count goes to 0
  2. scope.log("0x{x} deref 1 - 0")stderr write, ~100μs (ref_count.zig:253-258)
  3. count.debug.deinit() → frees tracking hashmaps (ref_count.zig:262-264)
  4. destroy(self)log("destroy: strong reached 0")another stderr write (SSLConfig.zig:275)
  5. GlobalRegistry.remove()mutex.lock()

Steps 2-4 widen the window to ~100-200μs. That's enough for a probe worker's setImmediate-driven intern loop to slip in.


Changes in this branch

  • src/bun.js/api/server/SSLConfig.zig — added debug logs (observability only, no behavior change)
  • test/js/web/fetch/sslconfig-race-repro.ts — standalone reproduction script

Why the debug repro proves the same crash as production

Production crash (release v1.3.11, linux x86_64_baseline):

Segmentation fault at address 0x00000000
openssl.c:1173       — create_ssl_context_from_bun_options → strdup(options.passphrase)
SocketContext.zig:245 — SSLWrapper init
ProxyTunnel.zig:277  — startProxyHandshake
http.zig:1609        — handleOnDataHeaders

Caught on the HTTP thread, when the dangling tls_props is dereferenced.

Our debug repro:

panic: reached unreachable code
ref_count.zig:476 — bun.assert(debug.magic == .valid) in assertValid
ref_count.zig:237 — count.debug.assertValid() in ref
SSLConfig.zig:313 — existing.ref() in intern
fetch.zig:471    — GlobalRegistry.intern

Caught on a worker JS thread, at the moment ref() is called on a dying config.

Same causal chain, different catch point:

  [HTTP thread]                   [JS thread]
  deref() fetchSub 1→0
    ↓ (race window open)
                                   intern() finds dying config in map
                                   existing.ref() → old_count = 0
                                   ┌────────────────────────────────────┐
                                   │ DEBUG: debugAssert + assertValid   │
                                   │        → panic  ← OUR REPRO        │
                                   ├────────────────────────────────────┤
                                   │ RELEASE: both are no-ops           │
                                   │        → ref() 0→1 silently succeeds│
                                   └────────────────────────────────────┘
  destroy() → deinit()             intern() returns dangling *SSLConfig
    frees passphrase/cert/key      stored in client.tls_props
  allocator.destroy(this)
    ↓                              ... proxy CONNECT round-trip ...
  (memory freed/reused)
                                   startProxyHandshake (http.zig:1457):
                                     ssl_options = this.tls_props.?.*
                                     ← READS FREED STRUCT
                                     ↓
                                   SSLWrapper.init → create_ssl_context_from_bun_options
                                     strdup(options.passphrase)
                                     ← passphrase slice is garbage/NULL
                                     → segfault at 0x0  ← PRODUCTION CRASH

In release builds, debugAssert(old_count > 0) (ref_count.zig:246) is a no-op, and the debug.magic check doesn't exist (enable_debug = false). So ref() silently bumps 0→1 and intern() returns the dangling pointer. The crash surfaces much later when startProxyHandshake dereferences this.tls_props.?.* — freed memory that mimalloc may have reused, giving garbage/NULL pointers to strdup.

In debug builds, debug.deinit() (ref_count.zig:518) sets debug.magic = undefined right after fetchSub(1→0). The next ref()'s assertValid() catches it. We crash at the source of the bug, not the symptom.

The same dangling pointer, same config, same race. Debug has better instrumentation that catches it at the moment the invariant ("refs only on live objects") is violated, ~200μs before the production symptom would appear.

Add scoped debug logs (BUN_DEBUG_SSLConfig=1) to GlobalRegistry.intern() and
destroy() to observe the race condition between intern() finding a dying config
and ref()'s debugAssert catching a 0->1 resurrection.

The test exercises the proxy+TLS path with keepalive:false and no ca (so
requires_custom_request_ctx=false, meaning the SSL context cache does NOT hold
an extra ref). This makes the refcount hit 0 on every wave completion.

Logs show:
- 'intern: found existing 0xABC, refcount=N before ref' — if N is ever 0,
  that's the 0->1 resurrection bug (debugAssert panics in debug builds,
  ASAN catches the UAF in release+ASAN builds)
- 'destroy 0xABC: strong reached 0, freeing' — proves refcount hits 0

On macOS the race window is too narrow to trigger naturally. On Linux CI
(especially linux-x64-asan with different thread scheduling), the race is
more likely to manifest.

See: #27838
@robobun

robobun commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator
Updated 5:06 PM PT - Mar 6th, 2026

@cirospaciari, your commit f42fa72 has 2 failures in Build #38772 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 27863

That installs a local version of the PR into your bun-27863 executable, so you can run:

bun-27863 --bun

autofix-ci Bot and others added 6 commits March 6, 2026 20:16
Standalone C program that reproduces the exact same race pattern as
SSLConfig.GlobalRegistry — shared map protected by mutex, atomic refcount,
deref→destroy without checking if intern() just resurrected the entry.

Build and run with ASAN:
  cc -fsanitize=address -pthread -O1 -o race_repro sslconfig-race-repro.c
  ./race_repro

ASAN reports heap-use-after-free on EVERY run:
  Thread T1: READ of interned->data (like ProxyTunnel reading tls_props)
  freed by Thread T2: config_destroy (like SSLConfig.destroy)

The race: thread A does fetchSub(1)→0→destroy()→free, while thread B
does intern()→find existing→ref() 0→1→returns dangling pointer.
This is the exact pattern causing the segfault at 0x0 in production.
Workers share the static SSLConfig.GlobalRegistry across threads.
Multiple workers doing serial proxy+TLS fetches with keepalive:false
creates real cross-thread contention between intern() (worker JS threads)
and deref() (worker HTTP threads).

On ASAN builds, if the 0->1 resurrection race triggers, ASAN catches
the heap-use-after-free — same pattern proven by sslconfig-race-repro.c.
8 workers x 20 waves x 16 concurrent fetches per wave = 2560 total.
Concurrent batches maximize overlap between deref() (HTTP threads)
and intern() (JS threads) across workers sharing the GlobalRegistry.
The CI test never triggered the race (window too narrow, workers
naturally synchronize). This debug-only script DOES reproduce it 3/3.

Key insight: BUN_DEBUG_SSLConfig=1 enables scoped logging in deref()
and destroy(), widening the race window from ~10 CPU cycles to 100μs+
via stderr writes. Driver worker cycles refcount through 0; probe
workers spam intern via fetch+abort in setImmediate loops.

See script header comment for full repro steps.
cirospaciari added a commit that referenced this pull request Mar 7, 2026
Non-deterministic by nature (race window is ~10 CPU cycles in release),
but catches the race ~60% on debug+ASAN without any special env vars
due to debug.deinit() in ref_count.zig widening the window.

Driver worker cycles refcount through 0 via serial proxy+TLS fetches.
Probe workers spam intern() via fetch+abort in setImmediate loops.

For deterministic repro use BUN_DEBUG_SSLConfig=1 — see #27863.
Jarred-Sumner pushed a commit that referenced this pull request Mar 8, 2026
…up (#27838)

## Problem

Segfault at address `0x0` in `create_ssl_context_from_bun_options`
during proxy tunnel setup (Bun v1.3.11, linux x86_64_baseline). The
crash is `strlen(NULL)` on a freed cert/key string.

## Root cause

`SSLConfig.GlobalRegistry` is a dedup cache keyed by TLS option content.
It stored raw pointers without contributing to refcount — a weak cache —
but `intern()` called plain `ref()` on what it found, which blindly does
`fetchAdd(1)` with only a `debugAssert(old > 0)` (no-op in release).

| HTTP thread (`deref`, last holder) | JS thread (`intern`, new fetch
same TLS opts) |
|---|---|
| `fetchSub` 1→0, enters `destroy()` | |
| | locks mutex, `getOrPut` finds dying entry (still in map) |
| | `ref()` → **0→1** (resurrection, no check in release) |
| | unlocks, returns dangling pointer to new fetch |
| locks mutex in `remove()`, evicts, unlocks | |
| `deinit()` frees cert/key strings | |
| `allocator.destroy()` | |
| | proxy tunnel builds SSL ctx from freed config → `strlen(NULL)` |

Proxy amplifies this: `ProxyTunnel.start()` creates a *second* SSL
context from `tls_props` *after* the CONNECT round-trip, giving a
multi-millisecond window for another request's teardown to win the race.

## Fix: Arc/Weak split refcounting

This is the Rust `Arc<T>`/`Weak<T>` pattern (and C++
`shared_ptr`/`weak_ptr`): two atomic counters instead of one.

### New primitive: `ThreadSafeWeakableRefCount`

Added to `src/ptr/ref_count.zig` alongside `ThreadSafeRefCount`.
General-purpose, reusable for any type that needs weak references.

```zig
strong: atomic u32    // live users; 1->0 calls drop_contents()
weak:   atomic u32    // weak holders + (1 if strong > 0); 1->0 calls free_memory()
```

The `+1` on weak is the "collective" weak ref held on behalf of all
strong refs. It guarantees the struct allocation stays live across the
`strong 1→0 → drop_contents()` window, so `upgrade()` is memory-safe as
long as *any* weak ref exists.

| Method | Does |
|---|---|
| `ref()` / `deref()` | bump/drop strong; at strong 1→0:
`drop_contents()` then drop collective weak |
| `weakRef()` / `weakDeref()` | bump/drop weak; at weak 1→0:
`free_memory()` |
| `upgrade()` | CAS-loop on strong, only increments if currently > 0 —
**never revives a dead object** |

### SSLConfig wiring

```zig
const RC = bun.ptr.ThreadSafeWeakableRefCount(@this(), "ref_count", dropContents, freeMemory, .{});

fn dropContents(this: *SSLConfig) void {  // strong 1->0
    GlobalRegistry.remove(this);          // while content intact (map eql needs it)
    this.deinit();                        // free strings
}
fn freeMemory(this: *SSLConfig) void {    // weak 1->0
    bun.default_allocator.destroy(this);
}
```

### Registry now holds weak refs

**`intern()`**
1. `getOrPut` by content.
2. Found existing? `upgrade()` it:
- **Success** (strong was > 0): got a real ref. Free `new_config`,
return existing.
- **Fail** (strong is 0, dying): registry `weakDeref`s the old entry,
replaces the slot with `new_config`.
3. Registry `weakRef`s the winning entry.

**`remove()`** (called from `dropContents`)
- Look up by content hash, check **pointer identity**. If `intern()`
replaced our slot while we were blocked on the mutex, the pointer won't
match → no-op (intern already dropped our weak ref).
- Otherwise evict and `weakDeref`.

### Why `upgrade()` on strong==0 is not a UAF

The weak count holds the allocation. Registry holds a weak ref →
`weak_count ≥ 1` → struct memory is live. The CAS reads a live atomic.
Contents may be garbage (mid-`deinit()`), but `upgrade()` only touches
`strong_count` — never the content.

This is the **key difference** from a single-counter design: memory
safety is compositional (refcounts alone guarantee it), not entangled
with mutex ordering. The mutex here only protects map structure and the
invariant that entry content is intact while in the map.

### Corrected race

| HTTP thread | JS thread | Result |
|---|---|---|
| strong 1→0, blocked in `remove()` | `upgrade()` fails, `weakDeref` old
(2→1), replace slot, `weakRef` new | old: `remove()` ptr-mismatch no-op
→ `deinit` → collective `weakDeref` (1→0) → free. JS thread has a fresh,
valid config. |

## Also

- `packages/bun-usockets/src/crypto/openssl.c`: NULL guards before
`strlen(content)` in `us_ssl_ctx_use_privatekey_content` and
`us_ssl_ctx_use_certificate_chain`. Defense-in-depth — turns a segfault
into a clean SSL error if a NULL ever slips through.
- `test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`: stress test
firing overlapping waves of 8 concurrent proxy fetches with identical
TLS options and `keepalive: false` (forces immediate deref on
completion, no keepalive pool masking the race).


---

## Reproduction & verification

See #27863 for the full reproduction recipe. With a debug+ASAN build and
`BUN_DEBUG_SSLConfig=1` (which widens the race window via stderr
logging), the repro script crashes **3/3 on main** and **passes 3/3 on
this branch**.

**Why it proves the same crash as production:**

The production segfault at `openssl.c:1173` and our debug assertion
failure at `ref_count.zig:476` are the **same root cause**, caught at
different points:

```
  [HTTP thread]                   [JS thread]
  deref() fetchSub 1→0
    ↓ (race window)
                                   intern() finds dying config
                                   ref() → old_count = 0
                                   ┌────────────────────────────┐
                                   │ DEBUG: assertValid → panic │ ← repro catches here
                                   │ RELEASE: silently succeeds │
                                   └────────────────────────────┘
  destroy/deinit/free              returns dangling *SSLConfig
                                   stored in client.tls_props
                                   ... proxy CONNECT ...
                                   tls_props.?.* → reads freed struct
                                   strdup(garbage) → segfault at 0x0  ← production crash
```

In release, `debugAssert(old_count > 0)` is a no-op. `ref()` silently
does 0→1, `intern()` returns the dangling pointer, and the crash
surfaces ~200μs later when `startProxyHandshake` dereferences the freed
struct. Same race, same dangling pointer — debug catches it at the
source, release at the symptom.

**On this branch:** `upgrade()` uses a CAS loop that refuses the 0→1
bump. When it fails, `intern()` replaces the map slot with a fresh
config instead of resurrecting the dying one. The repro logs show many
clean `deref 1 - 0` to different addresses (fresh allocations) with zero
`ref 0 - 1`.

---------

Co-authored-by: Claude <noreply@anthropic.com>
@cirospaciari cirospaciari deleted the claude/prove-sslconfig-uaf-race branch March 27, 2026 23:12
structwafel pushed a commit to structwafel/bun that referenced this pull request Apr 25, 2026
…up (oven-sh#27838)

## Problem

Segfault at address `0x0` in `create_ssl_context_from_bun_options`
during proxy tunnel setup (Bun v1.3.11, linux x86_64_baseline). The
crash is `strlen(NULL)` on a freed cert/key string.

## Root cause

`SSLConfig.GlobalRegistry` is a dedup cache keyed by TLS option content.
It stored raw pointers without contributing to refcount — a weak cache —
but `intern()` called plain `ref()` on what it found, which blindly does
`fetchAdd(1)` with only a `debugAssert(old > 0)` (no-op in release).

| HTTP thread (`deref`, last holder) | JS thread (`intern`, new fetch
same TLS opts) |
|---|---|
| `fetchSub` 1→0, enters `destroy()` | |
| | locks mutex, `getOrPut` finds dying entry (still in map) |
| | `ref()` → **0→1** (resurrection, no check in release) |
| | unlocks, returns dangling pointer to new fetch |
| locks mutex in `remove()`, evicts, unlocks | |
| `deinit()` frees cert/key strings | |
| `allocator.destroy()` | |
| | proxy tunnel builds SSL ctx from freed config → `strlen(NULL)` |

Proxy amplifies this: `ProxyTunnel.start()` creates a *second* SSL
context from `tls_props` *after* the CONNECT round-trip, giving a
multi-millisecond window for another request's teardown to win the race.

## Fix: Arc/Weak split refcounting

This is the Rust `Arc<T>`/`Weak<T>` pattern (and C++
`shared_ptr`/`weak_ptr`): two atomic counters instead of one.

### New primitive: `ThreadSafeWeakableRefCount`

Added to `src/ptr/ref_count.zig` alongside `ThreadSafeRefCount`.
General-purpose, reusable for any type that needs weak references.

```zig
strong: atomic u32    // live users; 1->0 calls drop_contents()
weak:   atomic u32    // weak holders + (1 if strong > 0); 1->0 calls free_memory()
```

The `+1` on weak is the "collective" weak ref held on behalf of all
strong refs. It guarantees the struct allocation stays live across the
`strong 1→0 → drop_contents()` window, so `upgrade()` is memory-safe as
long as *any* weak ref exists.

| Method | Does |
|---|---|
| `ref()` / `deref()` | bump/drop strong; at strong 1→0:
`drop_contents()` then drop collective weak |
| `weakRef()` / `weakDeref()` | bump/drop weak; at weak 1→0:
`free_memory()` |
| `upgrade()` | CAS-loop on strong, only increments if currently > 0 —
**never revives a dead object** |

### SSLConfig wiring

```zig
const RC = bun.ptr.ThreadSafeWeakableRefCount(@this(), "ref_count", dropContents, freeMemory, .{});

fn dropContents(this: *SSLConfig) void {  // strong 1->0
    GlobalRegistry.remove(this);          // while content intact (map eql needs it)
    this.deinit();                        // free strings
}
fn freeMemory(this: *SSLConfig) void {    // weak 1->0
    bun.default_allocator.destroy(this);
}
```

### Registry now holds weak refs

**`intern()`**
1. `getOrPut` by content.
2. Found existing? `upgrade()` it:
- **Success** (strong was > 0): got a real ref. Free `new_config`,
return existing.
- **Fail** (strong is 0, dying): registry `weakDeref`s the old entry,
replaces the slot with `new_config`.
3. Registry `weakRef`s the winning entry.

**`remove()`** (called from `dropContents`)
- Look up by content hash, check **pointer identity**. If `intern()`
replaced our slot while we were blocked on the mutex, the pointer won't
match → no-op (intern already dropped our weak ref).
- Otherwise evict and `weakDeref`.

### Why `upgrade()` on strong==0 is not a UAF

The weak count holds the allocation. Registry holds a weak ref →
`weak_count ≥ 1` → struct memory is live. The CAS reads a live atomic.
Contents may be garbage (mid-`deinit()`), but `upgrade()` only touches
`strong_count` — never the content.

This is the **key difference** from a single-counter design: memory
safety is compositional (refcounts alone guarantee it), not entangled
with mutex ordering. The mutex here only protects map structure and the
invariant that entry content is intact while in the map.

### Corrected race

| HTTP thread | JS thread | Result |
|---|---|---|
| strong 1→0, blocked in `remove()` | `upgrade()` fails, `weakDeref` old
(2→1), replace slot, `weakRef` new | old: `remove()` ptr-mismatch no-op
→ `deinit` → collective `weakDeref` (1→0) → free. JS thread has a fresh,
valid config. |

## Also

- `packages/bun-usockets/src/crypto/openssl.c`: NULL guards before
`strlen(content)` in `us_ssl_ctx_use_privatekey_content` and
`us_ssl_ctx_use_certificate_chain`. Defense-in-depth — turns a segfault
into a clean SSL error if a NULL ever slips through.
- `test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`: stress test
firing overlapping waves of 8 concurrent proxy fetches with identical
TLS options and `keepalive: false` (forces immediate deref on
completion, no keepalive pool masking the race).


---

## Reproduction & verification

See oven-sh#27863 for the full reproduction recipe. With a debug+ASAN build and
`BUN_DEBUG_SSLConfig=1` (which widens the race window via stderr
logging), the repro script crashes **3/3 on main** and **passes 3/3 on
this branch**.

**Why it proves the same crash as production:**

The production segfault at `openssl.c:1173` and our debug assertion
failure at `ref_count.zig:476` are the **same root cause**, caught at
different points:

```
  [HTTP thread]                   [JS thread]
  deref() fetchSub 1→0
    ↓ (race window)
                                   intern() finds dying config
                                   ref() → old_count = 0
                                   ┌────────────────────────────┐
                                   │ DEBUG: assertValid → panic │ ← repro catches here
                                   │ RELEASE: silently succeeds │
                                   └────────────────────────────┘
  destroy/deinit/free              returns dangling *SSLConfig
                                   stored in client.tls_props
                                   ... proxy CONNECT ...
                                   tls_props.?.* → reads freed struct
                                   strdup(garbage) → segfault at 0x0  ← production crash
```

In release, `debugAssert(old_count > 0)` is a no-op. `ref()` silently
does 0→1, `intern()` returns the dangling pointer, and the crash
surfaces ~200μs later when `startProxyHandshake` dereferences the freed
struct. Same race, same dangling pointer — debug catches it at the
source, release at the symptom.

**On this branch:** `upgrade()` uses a CAS loop that refuses the 0→1
bump. When it fails, `intern()` replaces the map slot with a fresh
config instead of resurrecting the dying one. The repro logs show many
clean `deref 1 - 0` to different addresses (fresh allocations) with zero
`ref 0 - 1`.

---------

Co-authored-by: Claude <noreply@anthropic.com>
xhjkl pushed a commit to xhjkl/bun that referenced this pull request May 14, 2026
…up (oven-sh#27838)

## Problem

Segfault at address `0x0` in `create_ssl_context_from_bun_options`
during proxy tunnel setup (Bun v1.3.11, linux x86_64_baseline). The
crash is `strlen(NULL)` on a freed cert/key string.

## Root cause

`SSLConfig.GlobalRegistry` is a dedup cache keyed by TLS option content.
It stored raw pointers without contributing to refcount — a weak cache —
but `intern()` called plain `ref()` on what it found, which blindly does
`fetchAdd(1)` with only a `debugAssert(old > 0)` (no-op in release).

| HTTP thread (`deref`, last holder) | JS thread (`intern`, new fetch
same TLS opts) |
|---|---|
| `fetchSub` 1→0, enters `destroy()` | |
| | locks mutex, `getOrPut` finds dying entry (still in map) |
| | `ref()` → **0→1** (resurrection, no check in release) |
| | unlocks, returns dangling pointer to new fetch |
| locks mutex in `remove()`, evicts, unlocks | |
| `deinit()` frees cert/key strings | |
| `allocator.destroy()` | |
| | proxy tunnel builds SSL ctx from freed config → `strlen(NULL)` |

Proxy amplifies this: `ProxyTunnel.start()` creates a *second* SSL
context from `tls_props` *after* the CONNECT round-trip, giving a
multi-millisecond window for another request's teardown to win the race.

## Fix: Arc/Weak split refcounting

This is the Rust `Arc<T>`/`Weak<T>` pattern (and C++
`shared_ptr`/`weak_ptr`): two atomic counters instead of one.

### New primitive: `ThreadSafeWeakableRefCount`

Added to `src/ptr/ref_count.zig` alongside `ThreadSafeRefCount`.
General-purpose, reusable for any type that needs weak references.

```zig
strong: atomic u32    // live users; 1->0 calls drop_contents()
weak:   atomic u32    // weak holders + (1 if strong > 0); 1->0 calls free_memory()
```

The `+1` on weak is the "collective" weak ref held on behalf of all
strong refs. It guarantees the struct allocation stays live across the
`strong 1→0 → drop_contents()` window, so `upgrade()` is memory-safe as
long as *any* weak ref exists.

| Method | Does |
|---|---|
| `ref()` / `deref()` | bump/drop strong; at strong 1→0:
`drop_contents()` then drop collective weak |
| `weakRef()` / `weakDeref()` | bump/drop weak; at weak 1→0:
`free_memory()` |
| `upgrade()` | CAS-loop on strong, only increments if currently > 0 —
**never revives a dead object** |

### SSLConfig wiring

```zig
const RC = bun.ptr.ThreadSafeWeakableRefCount(@this(), "ref_count", dropContents, freeMemory, .{});

fn dropContents(this: *SSLConfig) void {  // strong 1->0
    GlobalRegistry.remove(this);          // while content intact (map eql needs it)
    this.deinit();                        // free strings
}
fn freeMemory(this: *SSLConfig) void {    // weak 1->0
    bun.default_allocator.destroy(this);
}
```

### Registry now holds weak refs

**`intern()`**
1. `getOrPut` by content.
2. Found existing? `upgrade()` it:
- **Success** (strong was > 0): got a real ref. Free `new_config`,
return existing.
- **Fail** (strong is 0, dying): registry `weakDeref`s the old entry,
replaces the slot with `new_config`.
3. Registry `weakRef`s the winning entry.

**`remove()`** (called from `dropContents`)
- Look up by content hash, check **pointer identity**. If `intern()`
replaced our slot while we were blocked on the mutex, the pointer won't
match → no-op (intern already dropped our weak ref).
- Otherwise evict and `weakDeref`.

### Why `upgrade()` on strong==0 is not a UAF

The weak count holds the allocation. Registry holds a weak ref →
`weak_count ≥ 1` → struct memory is live. The CAS reads a live atomic.
Contents may be garbage (mid-`deinit()`), but `upgrade()` only touches
`strong_count` — never the content.

This is the **key difference** from a single-counter design: memory
safety is compositional (refcounts alone guarantee it), not entangled
with mutex ordering. The mutex here only protects map structure and the
invariant that entry content is intact while in the map.

### Corrected race

| HTTP thread | JS thread | Result |
|---|---|---|
| strong 1→0, blocked in `remove()` | `upgrade()` fails, `weakDeref` old
(2→1), replace slot, `weakRef` new | old: `remove()` ptr-mismatch no-op
→ `deinit` → collective `weakDeref` (1→0) → free. JS thread has a fresh,
valid config. |

## Also

- `packages/bun-usockets/src/crypto/openssl.c`: NULL guards before
`strlen(content)` in `us_ssl_ctx_use_privatekey_content` and
`us_ssl_ctx_use_certificate_chain`. Defense-in-depth — turns a segfault
into a clean SSL error if a NULL ever slips through.
- `test/js/web/fetch/fetch-proxy-tls-intern-race.test.ts`: stress test
firing overlapping waves of 8 concurrent proxy fetches with identical
TLS options and `keepalive: false` (forces immediate deref on
completion, no keepalive pool masking the race).


---

## Reproduction & verification

See oven-sh#27863 for the full reproduction recipe. With a debug+ASAN build and
`BUN_DEBUG_SSLConfig=1` (which widens the race window via stderr
logging), the repro script crashes **3/3 on main** and **passes 3/3 on
this branch**.

**Why it proves the same crash as production:**

The production segfault at `openssl.c:1173` and our debug assertion
failure at `ref_count.zig:476` are the **same root cause**, caught at
different points:

```
  [HTTP thread]                   [JS thread]
  deref() fetchSub 1→0
    ↓ (race window)
                                   intern() finds dying config
                                   ref() → old_count = 0
                                   ┌────────────────────────────┐
                                   │ DEBUG: assertValid → panic │ ← repro catches here
                                   │ RELEASE: silently succeeds │
                                   └────────────────────────────┘
  destroy/deinit/free              returns dangling *SSLConfig
                                   stored in client.tls_props
                                   ... proxy CONNECT ...
                                   tls_props.?.* → reads freed struct
                                   strdup(garbage) → segfault at 0x0  ← production crash
```

In release, `debugAssert(old_count > 0)` is a no-op. `ref()` silently
does 0→1, `intern()` returns the dangling pointer, and the crash
surfaces ~200μs later when `startProxyHandshake` dereferences the freed
struct. Same race, same dangling pointer — debug catches it at the
source, release at the symptom.

**On this branch:** `upgrade()` uses a CAS loop that refuses the 0→1
bump. When it fails, `intern()` replaces the map slot with a fresh
config instead of resurrecting the dying one. The repro logs show many
clean `deref 1 - 0` to different addresses (fresh allocations) with zero
`ref 0 - 1`.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants