Skip to content

perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O#1434

Closed
yamadashy wants to merge 1 commit intomainfrom
perf/bpe-preload-main-thread
Closed

perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O#1434
yamadashy wants to merge 1 commit intomainfrom
perf/bpe-preload-main-thread

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Apr 9, 2026

Summary

  • Pre-load gpt-tokenizer BPE rank data once on main thread instead of each worker loading independently from disk
  • Serialize to JSON string (~1.6MB) and pass to workers via warmup task
  • Workers deserialize and build encoder instead of reading from disk

Cherry-picked from fd6b625 (PR #1428)

Test plan

  • All tests passing
  • Build clean

Open with Devin

…er-worker file I/O

Each metrics worker thread independently loaded gpt-tokenizer's BPE rank data
(~3.6MB, 200K entries) from disk. Now the BPE data is loaded once on the main
thread, serialized to a JSON string (~1.6MB), and passed to each worker via the
warmup task. Workers deserialize and build the encoder instead of reading from disk.

Cherry-picked from fd6b625 (PR #1428)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

This change introduces BPE rank preloading capabilities across the metrics system. New APIs enable main-thread preloading of encoding data, serialization to JSON, and transmission to workers, which deserialize and initialize token counters from preloaded ranks. Falls back to disk-based initialization if deserialization fails.

Changes

Cohort / File(s) Summary
BPE Preloading Infrastructure
src/core/metrics/TokenCounter.ts, src/core/metrics/tokenCounterFactory.ts
Added BpeRanks type, loadBpeRanks() async function, TokenCounter.initFromBpeRanks() method for preloaded initialization, and initTokenCounterFromBpeRanks() factory function; shared helper createEncoderFromBpeRanks() eliminates duplicated encoder construction logic.
Metrics Warmup Integration
src/core/metrics/calculateMetrics.ts
Modified createMetricsTaskRunner to preload BPE ranks on main thread, serialize to JSON string, and conditionally include bpeRanksJson in warmup task payloads; falls back to omitting the field if preloading fails.
Worker Task Protocol
src/core/metrics/workers/calculateMetricsWorker.ts
Extended TokenCountTask interface with optional bpeRanksJson field; added deserialization logic in countTokens to detect, parse, and initialize from preloaded BPE data with fallback to disk-backed initialization.
Test Updates
tests/core/metrics/calculateMetrics.test.ts
Updated mocks and assertions to expect bpeRanksJson field in warmup task payloads.

Sequence Diagram

sequenceDiagram
    participant MT as Main Thread
    participant TC as TokenCounter
    participant WM as Warmup Manager
    participant W as Worker Thread

    MT->>TC: loadBpeRanks(encoding)
    TC-->>MT: BpeRanks (Promise)
    activate MT
    MT->>MT: JSON.stringify(bpeRanks)
    deactivate MT
    
    MT->>WM: createMetricsTaskRunner(numTasks, encoding)
    WM->>W: warmup with {content, encoding, bpeRanksJson}
    
    W->>W: JSON.parse(bpeRanksJson)
    W->>TC: initTokenCounterFromBpeRanks(encoding, bpeRanks)
    TC-->>W: TokenCounter initialized
    W->>W: countTokens(task)
    W-->>WM: token count result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description provides a clear summary of changes and test plan, but is missing the checklist items specified in the repository's description template. Include the checklist items from the template (npm run test, npm run lint) and mark them as completed to fully conform to the repository standard.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main performance optimization: pre-loading BPE data on the main thread to eliminate redundant per-worker file I/O.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/bpe-preload-main-thread

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

⚡ Performance Benchmark

Latest commit:6a5a710 perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O
Status:✅ Benchmark complete!
Ubuntu:1.52s (±0.04s) → 1.58s (±0.04s) · +0.06s (+3.9%)
macOS:1.45s (±0.43s) → 1.59s (±0.49s) · +0.14s (+9.3%)
Windows:1.81s (±0.03s) → 1.86s (±0.04s) · +0.05s (+2.6%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 46.66667% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.03%. Comparing base (eafa70a) to head (6a5a710).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
src/core/metrics/TokenCounter.ts 50.00% 7 Missing ⚠️
src/core/metrics/tokenCounterFactory.ts 16.66% 5 Missing ⚠️
src/core/metrics/workers/calculateMetricsWorker.ts 25.00% 3 Missing ⚠️
src/core/metrics/calculateMetrics.ts 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1434      +/-   ##
==========================================
- Coverage   87.32%   87.03%   -0.29%     
==========================================
  Files         117      117              
  Lines        4426     4451      +25     
  Branches     1022     1025       +3     
==========================================
+ Hits         3865     3874       +9     
- Misses        561      577      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/core/metrics/workers/calculateMetricsWorker.ts (1)

44-55: Consider logging parse failures at debug level.

The silent catch provides a graceful fallback to disk loading, but swallowing all errors makes debugging difficult if deserialization fails unexpectedly (e.g., corrupted data, memory issues).

💡 Optional: Add debug logging for parse failures
     if (task.bpeRanksJson) {
       try {
         const bpeRanks = JSON.parse(task.bpeRanksJson);
         initTokenCounterFromBpeRanks(task.encoding, bpeRanks);
-      } catch {
+      } catch (error) {
+        logger.debug('Failed to parse pre-loaded BPE data, falling back to disk:', error);
         // Fall through to getTokenCounter which loads from disk
       }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/metrics/workers/calculateMetricsWorker.ts` around lines 44 - 55, The
JSON.parse swallow in the bpe warmup path hides useful failure details; update
the catch in the block that checks task.bpeRanksJson to catch the error (catch
(err)) and emit a debug-level log including the error and contextual info (e.g.,
task.encoding and that bpeRanksJson failed) before falling back to
getTokenCounter; use the module's logger (e.g., processLogger.debug or the
existing debug logger) and keep calling initTokenCounterFromBpeRanks on success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/core/metrics/workers/calculateMetricsWorker.ts`:
- Around line 44-55: The JSON.parse swallow in the bpe warmup path hides useful
failure details; update the catch in the block that checks task.bpeRanksJson to
catch the error (catch (err)) and emit a debug-level log including the error and
contextual info (e.g., task.encoding and that bpeRanksJson failed) before
falling back to getTokenCounter; use the module's logger (e.g.,
processLogger.debug or the existing debug logger) and keep calling
initTokenCounterFromBpeRanks on success.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3b01924-3306-4c50-b52f-a2380a70d68e

📥 Commits

Reviewing files that changed from the base of the PR and between eafa70a and 6a5a710.

📒 Files selected for processing (5)
  • src/core/metrics/TokenCounter.ts
  • src/core/metrics/calculateMetrics.ts
  • src/core/metrics/tokenCounterFactory.ts
  • src/core/metrics/workers/calculateMetricsWorker.ts
  • tests/core/metrics/calculateMetrics.test.ts

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the initialization of the TokenCounter by pre-loading BPE rank data on the main thread and passing it to worker threads via IPC, significantly reducing per-worker file I/O overhead. The review feedback highlights a potential race condition regarding task queuing and suggests adding a guard function to prevent redundant JSON parsing in worker threads. I have included the suggested improvements to the worker initialization logic to ensure efficiency.

Comment on lines +64 to 70
const warmupPromise = bpeRanksJsonPromise.then((bpeRanksJson) =>
Promise.all(
Array.from({ length: maxThreads }, () =>
taskRunner.run({ content: '', encoding, ...(bpeRanksJson != null && { bpeRanksJson }) }).catch(() => 0),
),
),
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential race condition in how the warmup tasks are queued. createMetricsTaskRunner returns the taskRunner immediately, but the warmup tasks are only queued after the BPE data is loaded and serialized (which is an async operation). If the caller submits real tasks to the runner immediately after calling createMetricsTaskRunner, those tasks will likely be queued before the warmup tasks, causing the workers to perform the expensive disk I/O anyway.

To ensure the optimization is effective, the caller must await the warmupPromise before submitting real tasks, or the initialization logic should be adjusted to ensure warmup tasks are prioritized at the head of the queue.

import { logger, setLogLevelByWorkerData } from '../../../shared/logger.js';
import type { TokenEncoding } from '../TokenCounter.js';
import { freeTokenCounters, getTokenCounter } from '../tokenCounterFactory.js';
import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks } from '../tokenCounterFactory.js';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import the new guard function to check for initialization status.

Suggested change
import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks } from '../tokenCounterFactory.js';
import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks, isTokenCounterInitialized } from '../tokenCounterFactory.js';

Comment on lines +48 to +55
if (task.bpeRanksJson) {
try {
const bpeRanks = JSON.parse(task.bpeRanksJson);
initTokenCounterFromBpeRanks(task.encoding, bpeRanks);
} catch {
// Fall through to getTokenCounter which loads from disk
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The worker currently parses the bpeRanksJson string every time it's provided in a task. While it's primarily intended for warmup tasks, adding a guard check here prevents expensive and redundant JSON parsing (of a ~1.6MB string) if multiple tasks with BPE data are received by the same worker before it has finished initializing.

Suggested change
if (task.bpeRanksJson) {
try {
const bpeRanks = JSON.parse(task.bpeRanksJson);
initTokenCounterFromBpeRanks(task.encoding, bpeRanks);
} catch {
// Fall through to getTokenCounter which loads from disk
}
}
if (task.bpeRanksJson && !isTokenCounterInitialized(task.encoding)) {
try {
const bpeRanks = JSON.parse(task.bpeRanksJson);
initTokenCounterFromBpeRanks(task.encoding, bpeRanks);
} catch {
// Fall through to getTokenCounter which loads from disk
}
}

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@yamadashy yamadashy closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant