Skip to content

CachedTypes: stop encoding source text into CachedSourceCodeKey#186

Merged
Jarred-Sumner merged 2 commits into
mainfrom
claude/bytecode-cache-skip-source-string
Apr 21, 2026
Merged

CachedTypes: stop encoding source text into CachedSourceCodeKey#186
Jarred-Sumner merged 2 commits into
mainfrom
claude/bytecode-cache-skip-source-string

Conversation

@dylan-conway

@dylan-conway dylan-conway commented Apr 20, 2026

Copy link
Copy Markdown
Member

Summary

CachedStringSourceProvider::encode serializes the full source text into the bytecode cache (it appears verbatim at byte offset 272 of every .jsc blob), and CachedStringSourceProvider::decode heap-allocates a fresh copy via CachedString::decodeAtomStringImpl::add. That copy is then pinned by a CachedRefPtr finalizer for the lifetime of the Decoder, which every UnlinkedFunctionExecutable with a lazy body keeps alive.

Under BUN_JSC_ADDITIONS, SourceCodeKey::operator== already skips string() == other.string(), so neither the on-disk copy nor the heap copy is ever read. The source text ends up resident three times at startup: the StandaloneModuleGraph mmap, the .jsc mmap, and a heap AtomStringImpl of the same bytes.

This change, gated on BUN_JSC_ADDITIONS:

  • encode: stores source().length() instead of the source bytes
  • decode: reuses decoder.provider() when sourceType and length match, instead of allocating a new StringSourceProvider; otherwise builds an empty provider so the key comparison rejects the entry conservatively

Why this is safe — where errors/stacks/toString actually get source

The bytecode never carried source for runtime use. There are two independent data flows that meet at ScriptExecutable:

StandaloneModuleGraph (mmap'd __BUN section)
   │
   ├──► source bytes ──► ModuleLoader.zig:1213 .source_code = file.toWTFString()
   │                             │
   │                             ▼
   │                    ZigSourceProvider (m_source holds the bytes)      ◄── COPY #1
   │                             │
   │                             ▼
   │                    SourceCode { provider = ZigSourceProvider, start, end }
   │                             │
   │                             ▼
   │               ┌───────────────────────────────────┐
   │               │  ScriptExecutable                 │
   │               │   ├─ m_source: SourceCode  ◄──────┼── set in ctor (ScriptExecutable.cpp:52)
   │               │   └─ m_unlinkedCodeBlock  ◄──┐    │
   │               └──────────────────────────────┼────┘
   │                       │                      │
   │     toString/stack/   │                      │
   │     reparse all read  ▼                      │
   │     m_source ────► CodeBlock::source() = m_ownerExecutable->source()  (CodeBlock.h:432)
   │
   └──► bytecode bytes ──► ModuleLoader.zig:1217 .bytecode_cache = file.bytecode.ptr
                                 │
                                 ▼
                        ZigSourceProvider::m_cachedBytecode  (ZigSourceProvider.cpp:136)
                                 │
                                 ▼
                        CodeCache::fetchFromDiskImpl → decodeCodeBlock
                                 │
                  ┌──────────────┼─────────────────────────────────────────┐
                  │ decodeCodeBlockImpl                                    │
                  │   Decoder created with m_provider = &runtime provider  │
                  │   entry.first  = decoded SourceCodeKey                 │
                  │     └─ m_sourceCode.m_provider = (was: NEW heap copy ◄─┼─ COPY #3, never read
                  │                                   now: reuse runtime)  │
                  │   entry.second = UnlinkedCodeBlock  (no provider field)│
                  │   if (entry.first != key) return null;   ◄─ only use   │
                  │   return entry.second  ────────────────────────────────┼──┐
                  └────────────────────────────────────────────────────────┘  │
                                                                     ─────────┘

UnlinkedCodeBlock has no SourceProvider field — only offsets. Those offsets are resolved against ScriptExecutable::m_source, which is the runtime SourceCode set before decodeCodeBlock is called:

  • Stack traces / errorsCodeBlock::source()m_ownerExecutable->source() (CodeBlock.h:432)
  • Function.prototype.toStringFunctionExecutable::source()ScriptExecutable::m_source
  • Re-parse fallbackUnlinkedFunctionExecutable::linkedSourceCode(parentSource)SourceCode(parentSource.provider(), startOffset, ...) (UnlinkedFunctionExecutable.cpp:191)

The CachedSourceCodeKey's provider (entry.first.m_sourceCode.m_provider) is a separate object that exists for ~10 stack frames inside decodeCodeBlockImpl, is read once by operator== (which under BUN_JSC_ADDITIONS ignores its source bytes), and is destroyed when entry.first goes out of scope. It was never wired into any executable.

Format compatibility

m_sourceLength (4 bytes) replaces CachedString m_source in the on-disk layout. GenericCacheEntry::decode checks isUpToDate() (the m_cacheVersion uint32 at offset 0) before touching m_key, so an old-format .jsc is rejected before the layout change is read. m_cacheVersion = hash(BUN_WEBKIT_VERSION) changes when Bun bumps its WebKit pin to pick up this PR.

Verified empirically: old .jsc (system bun) + new decoder → [Disk Cache] Cache miss → falls back to parse, output correct. New .jsc + new decoder → [Disk Cache] Cache hit → output correct.

Effect (release build, macOS arm64)

App .jsc size top-level decode footprint Δ
4.9 MB bundle, before 31.25 MB +4.92 MB
4.9 MB bundle, after 26.37 MB +16 KB
17.8 MB bundle, before 97.32 MB +17.84 MB
17.8 MB bundle, after 79.53 MB +32 KB

BUN_JSC_verboseDiskCache=1 confirms [Disk Cache] Cache hit after the change.

Test plan

  • bun build --bytecode --compile output runs and produces identical stdout
  • BUN_JSC_verboseDiskCache=1 shows Cache hit for sourceCode (not falling back to reparse)
  • Debug-local build (assertions on): Function.prototype.toString, Error.stack, nested-lazy-decode, async, eval, new Function all pass with cache hit
  • Cross-version .jsc (old encoder + new decoder) cleanly rejected via m_cacheVersion, falls back to parse
  • Bun test/bundler/bundler_compile.test.ts — 54/54 pass (release)
  • Bun test/bundler/bundler_banner.test.ts — 11/11 pass
  • Bun test/bundler/bun-build-api.test.ts -t bytecode — 1/1 pass
  • Bun CI with bumped WebKit pin

CachedStringSourceProvider::encode serializes the full source text into
the bytecode cache (it lands at byte offset 272 of every .jsc blob), and
CachedStringSourceProvider::decode reconstructs it via
CachedString::decode → AtomStringImpl::add, heap-allocating a fresh
copy. The result is held by a CachedRefPtr finalizer for the lifetime of
the Decoder, which in turn is kept alive by every UnlinkedFunctionExecutable
with a lazy body.

Under BUN_JSC_ADDITIONS, SourceCodeKey::operator== already skips
string() == other.string(), so neither the on-disk copy nor the heap
copy is ever read. They are pure overhead: ~source_size bytes of .jsc
bloat plus ~source_size bytes of dirty footprint at startup, with the
source text resident three times (StandaloneModuleGraph mmap, .jsc mmap,
heap AtomStringImpl).

Encode source().length() instead of the bytes, and at decode reuse the
SourceProvider the Decoder was constructed with (decoder.provider())
when its sourceType and length match. The fallback path constructs an
empty provider so the key length comparison rejects the entry.
@coderabbitai

coderabbitai Bot commented Apr 20, 2026

Copy link
Copy Markdown

Walkthrough

Under USE(BUN_JSC_ADDITIONS), CachedStringSourceProvider::encode now records only the source length; decode tries to reuse an existing runtime SourceProvider when type and length match, otherwise it skips cached source bytes and synthesizes an empty StringSourceProvider. Non-BUN behavior is unchanged.

Changes

Cohort / File(s) Summary
CachedStringSourceProvider Optimization
Source/JavaScriptCore/runtime/CachedTypes.cpp
Under USE(BUN_JSC_ADDITIONS), encode stores m_sourceLength instead of serializing full source bytes. decode may return the runtime's existing SourceProvider if sourceType and provider->source().length() match m_sourceLength; otherwise it skips decoding the cached string bytes (leaving decodedSource empty) and constructs a StringSourceProvider from the synthesized empty source. Non-BUN encode/decode continue to serialize/deserialize the full source.
🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description deviates significantly from the required WebKit template structure. It lacks a Bugzilla bug reference, formal commit message format, and the required template sections (bug title, Bugzilla link, Reviewed by, etc.), instead providing a detailed technical explanation in a different format. Reformat the PR description to follow the WebKit template: include Bugzilla bug ID, 'Reviewed by NOBODY (OOPS!)', explanation section, and list of changed paths with function/method names as specified in the template.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: stopping the encoding of source text into CachedSourceCodeKey, which aligns with the primary motivation documented in the description.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Source/JavaScriptCore/runtime/CachedTypes.cpp`:
- Around line 1665-1671: The code unsafely reinterpret_casts a SourceProvider to
StringSourceProvider after matching sourceType and mutates the reused provider
via Base::decode(decoder, *provider), which can cause UB or race conditions; add
a runtime type check (e.g., virtual asStringSourceProvider() or dynamic_cast
when RTTI is available) on the object returned by decoder.provider() before
casting and only cast when that check confirms a StringSourceProvider, and avoid
mutating a shared provider by either decoding into a fresh copy or using a
non-mutating decode API/clone before calling Base::decode so that
sourceURLDirective, sourceMappingURLDirective, and sourceTaintedOrigin are not
changed on a provider that may be shared.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 904ea0da-df9a-48f1-a5a0-82e7e3e54bba

📥 Commits

Reviewing files that changed from the base of the PR and between c0eac76 and d3a0896.

📒 Files selected for processing (1)
  • Source/JavaScriptCore/runtime/CachedTypes.cpp

Comment thread Source/JavaScriptCore/runtime/CachedTypes.cpp
@github-actions

github-actions Bot commented Apr 20, 2026

Copy link
Copy Markdown

Preview Builds

Commit Release Date
60f29941 autobuild-preview-pr-186-60f29941 2026-04-20 23:53:52 UTC
d3a08965 autobuild-preview-pr-186-d3a08965 2026-04-20 08:33:03 UTC

dylan-conway added a commit to oven-sh/bun that referenced this pull request Apr 20, 2026
Pulls in the CachedTypes change that stops encoding the bundled source
text into CachedSourceCodeKey. Under BUN_JSC_ADDITIONS the
SourceCodeKey::operator== string comparison is already skipped, so the
encoded source bytes were never read — but they were still written into
every .jsc blob and heap-allocated as an AtomStringImpl during
decodeCodeBlockImpl, then pinned by a Decoder finalizer for the lifetime
of every UnlinkedFunctionExecutable with a lazy body.

Preview pin so CI can run the full test suite against the WebKit change
before it merges; will be replaced with the real commit hash once #186
lands.
Addresses review feedback on the reuse path:

- Change the BUN_JSC_ADDITIONS decode() return type to SourceProvider*.
  The only caller (CachedSourceProvider::decode) already returns
  SourceProvider*, so the prior reinterpret_cast through the
  StringSourceProvider sibling was unnecessary type-punning. The runtime
  provider's leakRef() now upcasts naturally.

- Drop Base::decode(decoder, *provider) for the reuse path. It mutated
  the runtime provider's sourceURLDirective / sourceMappingURLDirective /
  sourceTaintedOrigin with values encoded from the compile-time provider.
  The runtime provider already carries the correct values; the decoded
  key needs only sourceOrigin().url().host() and length() for the
  equality check, neither of which Base::decode supplies.
  CachedSourceProviderShape fields are offset-addressed (not
  stream-positioned), so leaving them undecoded does not affect
  subsequent reads.

No on-disk format change vs the previous commit.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Source/JavaScriptCore/runtime/CachedTypes.cpp`:
- Around line 1682-1686: The fallback in decodeSourceCodeKey() creates an empty
StringSourceProvider (String decodedSource) so provider()->source().length()
becomes 0, causing decodedKey == key to always miss when decoder.provider() is
absent (affecting isCachedBytecodeStillValid() and decodeSourceCodeKey());
instead either thread the original key.source().provider() into the Decoder
constructed in the validity-check path (use key.source().provider() when
decoder.provider() is missing) or, if synthesising a provider, preserve the
original source length by setting m_sourceLength to key.source().length() rather
than hard-coding 0; update decodeSourceCodeKey(), the Decoder construction sites
used by isCachedBytecodeStillValid(), and any StringSourceProvider fallback
logic to use key.source().provider() or preserve m_sourceLength accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f5154e15-64db-4e24-a5c0-de953dce8ebb

📥 Commits

Reviewing files that changed from the base of the PR and between d3a0896 and 60f2994.

📒 Files selected for processing (1)
  • Source/JavaScriptCore/runtime/CachedTypes.cpp

Comment on lines +1682 to +1686
// Fallback for callers that did not supply a provider: decode without source
// bytes. SourceCodeKey::operator== ignores string(), but length() is compared,
// so synthesize a provider whose source() is empty — length() will mismatch
// and the cache entry will be rejected, which is the conservative behaviour.
String decodedSource;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't make key-only decode paths synthesize a zero-length provider.

When decoder.provider() is absent, Lines 1682-1686 fall back to an empty StringSourceProvider, so source().length() becomes 0. That breaks local no-provider key comparisons under USE(BUN_JSC_ADDITIONS): isCachedBytecodeStillValid() still constructs its Decoder without a provider on Line 2779, so decodedKey == key now becomes an unconditional cache miss for every non-empty source. decodeSourceCodeKey() on Line 2740 is lossy for the same reason.

At minimum, thread key.source().provider() into the validity-check path; otherwise this fallback needs to preserve m_sourceLength instead of hard-coding 0.

🔧 Minimal fix for the validity-check path
 bool isCachedBytecodeStillValid(VM& vm, Ref<CachedBytecode> cachedBytecode, const SourceCodeKey& key, SourceCodeType type)
 {
     auto span = cachedBytecode->span();
     if (span.empty())
         return false;
     auto* cachedEntry = std::bit_cast<const GenericCacheEntry*>(span.data());
-    Ref decoder = Decoder::create(vm, WTF::move(cachedBytecode));
+    Ref decoder = Decoder::create(vm, WTF::move(cachedBytecode), &key.source().provider());
     return cachedEntry->isStillValid(decoder.get(), key, tagFromSourceCodeType(type));
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Source/JavaScriptCore/runtime/CachedTypes.cpp` around lines 1682 - 1686, The
fallback in decodeSourceCodeKey() creates an empty StringSourceProvider (String
decodedSource) so provider()->source().length() becomes 0, causing decodedKey ==
key to always miss when decoder.provider() is absent (affecting
isCachedBytecodeStillValid() and decodeSourceCodeKey()); instead either thread
the original key.source().provider() into the Decoder constructed in the
validity-check path (use key.source().provider() when decoder.provider() is
missing) or, if synthesising a provider, preserve the original source length by
setting m_sourceLength to key.source().length() rather than hard-coding 0;
update decodeSourceCodeKey(), the Decoder construction sites used by
isCachedBytecodeStillValid(), and any StringSourceProvider fallback logic to use
key.source().provider() or preserve m_sourceLength accordingly.

@Jarred-Sumner Jarred-Sumner merged commit 4b07413 into main Apr 21, 2026
35 checks passed
@dylan-conway dylan-conway deleted the claude/bytecode-cache-skip-source-string branch April 21, 2026 00:46
dylan-conway added a commit to oven-sh/bun that referenced this pull request Apr 21, 2026
Stops encoding source text into CachedSourceCodeKey under
BUN_JSC_ADDITIONS — the decoded key only feeds SourceCodeKey::operator==
which already skips the byte comparison there. Removes a per-chunk
~source_size AtomStringImpl heap allocation at decode and the matching
bytes from each .jsc on disk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants