perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O by yamadashy · Pull Request #1434 · yamadashy/repomix

yamadashy · 2026-04-09T23:44:32Z

Summary

Pre-load gpt-tokenizer BPE rank data once on main thread instead of each worker loading independently from disk
Serialize to JSON string (~1.6MB) and pass to workers via warmup task
Workers deserialize and build encoder instead of reading from disk

Cherry-picked from fd6b625 (PR #1428)

Test plan

All tests passing
Build clean

…er-worker file I/O Each metrics worker thread independently loaded gpt-tokenizer's BPE rank data (~3.6MB, 200K entries) from disk. Now the BPE data is loaded once on the main thread, serialized to a JSON string (~1.6MB), and passed to each worker via the warmup task. Workers deserialize and build the encoder instead of reading from disk. Cherry-picked from fd6b625 (PR #1428) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-04-09T23:44:48Z

📝 Walkthrough

Walkthrough

This change introduces BPE rank preloading capabilities across the metrics system. New APIs enable main-thread preloading of encoding data, serialization to JSON, and transmission to workers, which deserialize and initialize token counters from preloaded ranks. Falls back to disk-based initialization if deserialization fails.

Changes

Cohort / File(s)	Summary
BPE Preloading Infrastructure `src/core/metrics/TokenCounter.ts`, `src/core/metrics/tokenCounterFactory.ts`	Added `BpeRanks` type, `loadBpeRanks()` async function, `TokenCounter.initFromBpeRanks()` method for preloaded initialization, and `initTokenCounterFromBpeRanks()` factory function; shared helper `createEncoderFromBpeRanks()` eliminates duplicated encoder construction logic.
Metrics Warmup Integration `src/core/metrics/calculateMetrics.ts`	Modified `createMetricsTaskRunner` to preload BPE ranks on main thread, serialize to JSON string, and conditionally include `bpeRanksJson` in warmup task payloads; falls back to omitting the field if preloading fails.
Worker Task Protocol `src/core/metrics/workers/calculateMetricsWorker.ts`	Extended `TokenCountTask` interface with optional `bpeRanksJson` field; added deserialization logic in `countTokens` to detect, parse, and initialize from preloaded BPE data with fallback to disk-backed initialization.
Test Updates `tests/core/metrics/calculateMetrics.test.ts`	Updated mocks and assertions to expect `bpeRanksJson` field in warmup task payloads.

Sequence Diagram

sequenceDiagram
    participant MT as Main Thread
    participant TC as TokenCounter
    participant WM as Warmup Manager
    participant W as Worker Thread

    MT->>TC: loadBpeRanks(encoding)
    TC-->>MT: BpeRanks (Promise)
    activate MT
    MT->>MT: JSON.stringify(bpeRanks)
    deactivate MT
    
    MT->>WM: createMetricsTaskRunner(numTasks, encoding)
    WM->>W: warmup with {content, encoding, bpeRanksJson}
    
    W->>W: JSON.parse(bpeRanksJson)
    W->>TC: initTokenCounterFromBpeRanks(encoding, bpeRanks)
    TC-->>W: TokenCounter initialized
    W->>W: countTokens(task)
    W-->>WM: token count result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

perf(metrics): Warm up all metrics worker threads in parallel #1374: Modifies createMetricsTaskRunner warmup control flow and task payload structure, directly impacted by this PR's changes to warmup task initialization and field additions.
perf(core): Replace tiktoken WASM with gpt-tokenizer #1350: Replaced tiktoken with gpt-tokenizer for token encoding; this PR builds on that tokenizer integration by adding BPE preloading and worker-side initialization paths that depend on the underlying encoding APIs.
perf(core): Pre-initialize metrics worker pool to overlap tiktoken WASM loading #1302: Touches metrics warmup initialization in createMetricsTaskRunner; this PR extends the same warmup path with preloaded BPE data serialization and worker protocol changes.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description provides a clear summary of changes and test plan, but is missing the checklist items specified in the repository's description template.	Include the checklist items from the template (npm run test, npm run lint) and mark them as completed to fully conform to the repository standard.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main performance optimization: pre-loading BPE data on the main thread to eliminate redundant per-worker file I/O.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/bpe-preload-main-thread

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-09T23:45:40Z

⚡ Performance Benchmark

Latest commit:	`6a5a710` perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O
Status:	✅ Benchmark complete!
Ubuntu:	1.52s (±0.04s) → 1.58s (±0.04s) · +0.06s (+3.9%)
macOS:	1.45s (±0.43s) → 1.59s (±0.49s) · +0.14s (+9.3%)
Windows:	1.81s (±0.03s) → 1.86s (±0.04s) · +0.05s (+2.6%)

Details

Packing the repomix repository with node bin/repomix.cjs
Warmup: 2 runs (discarded), interleaved execution
Measurement: 20 runs / 30 on macOS (median ± IQR)
Workflow run

codecov · 2026-04-09T23:45:45Z

Codecov Report

❌ Patch coverage is 46.66667% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.03%. Comparing base (eafa70a) to head (6a5a710).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
src/core/metrics/TokenCounter.ts	50.00%	7 Missing ⚠️
src/core/metrics/tokenCounterFactory.ts	16.66%	5 Missing ⚠️
src/core/metrics/workers/calculateMetricsWorker.ts	25.00%	3 Missing ⚠️
src/core/metrics/calculateMetrics.ts	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1434      +/-   ##
==========================================
- Coverage   87.32%   87.03%   -0.29%     
==========================================
  Files         117      117              
  Lines        4426     4451      +25     
  Branches     1022     1025       +3     
==========================================
+ Hits         3865     3874       +9     
- Misses        561      577      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

🧹 Nitpick comments (1)

src/core/metrics/workers/calculateMetricsWorker.ts (1)

44-55: Consider logging parse failures at debug level.

The silent catch provides a graceful fallback to disk loading, but swallowing all errors makes debugging difficult if deserialization fails unexpectedly (e.g., corrupted data, memory issues).

💡 Optional: Add debug logging for parse failures

     if (task.bpeRanksJson) {
       try {
         const bpeRanks = JSON.parse(task.bpeRanksJson);
         initTokenCounterFromBpeRanks(task.encoding, bpeRanks);
-      } catch {
+      } catch (error) {
+        logger.debug('Failed to parse pre-loaded BPE data, falling back to disk:', error);
         // Fall through to getTokenCounter which loads from disk
       }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/core/metrics/workers/calculateMetricsWorker.ts` around lines 44 - 55, The
JSON.parse swallow in the bpe warmup path hides useful failure details; update
the catch in the block that checks task.bpeRanksJson to catch the error (catch
(err)) and emit a debug-level log including the error and contextual info (e.g.,
task.encoding and that bpeRanksJson failed) before falling back to
getTokenCounter; use the module's logger (e.g., processLogger.debug or the
existing debug logger) and keep calling initTokenCounterFromBpeRanks on success.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/core/metrics/workers/calculateMetricsWorker.ts`:
- Around line 44-55: The JSON.parse swallow in the bpe warmup path hides useful
failure details; update the catch in the block that checks task.bpeRanksJson to
catch the error (catch (err)) and emit a debug-level log including the error and
contextual info (e.g., task.encoding and that bpeRanksJson failed) before
falling back to getTokenCounter; use the module's logger (e.g.,
processLogger.debug or the existing debug logger) and keep calling
initTokenCounterFromBpeRanks on success.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3b01924-3306-4c50-b52f-a2380a70d68e

📥 Commits

Reviewing files that changed from the base of the PR and between eafa70a and 6a5a710.

📒 Files selected for processing (5)

src/core/metrics/TokenCounter.ts
src/core/metrics/calculateMetrics.ts
src/core/metrics/tokenCounterFactory.ts
src/core/metrics/workers/calculateMetricsWorker.ts
tests/core/metrics/calculateMetrics.test.ts

gemini-code-assist

Code Review

This pull request optimizes the initialization of the TokenCounter by pre-loading BPE rank data on the main thread and passing it to worker threads via IPC, significantly reducing per-worker file I/O overhead. The review feedback highlights a potential race condition regarding task queuing and suggests adding a guard function to prevent redundant JSON parsing in worker threads. I have included the suggested improvements to the worker initialization logic to ensure efficiency.

gemini-code-assist · 2026-04-09T23:48:45Z

src/core/metrics/calculateMetrics.ts

+  const warmupPromise = bpeRanksJsonPromise.then((bpeRanksJson) =>
+    Promise.all(
+      Array.from({ length: maxThreads }, () =>
+        taskRunner.run({ content: '', encoding, ...(bpeRanksJson != null && { bpeRanksJson }) }).catch(() => 0),
+      ),
+    ),
  );


There is a potential race condition in how the warmup tasks are queued. createMetricsTaskRunner returns the taskRunner immediately, but the warmup tasks are only queued after the BPE data is loaded and serialized (which is an async operation). If the caller submits real tasks to the runner immediately after calling createMetricsTaskRunner, those tasks will likely be queued before the warmup tasks, causing the workers to perform the expensive disk I/O anyway.

To ensure the optimization is effective, the caller must await the warmupPromise before submitting real tasks, or the initialization logic should be adjusted to ensure warmup tasks are prioritized at the head of the queue.

gemini-code-assist · 2026-04-09T23:48:45Z

src/core/metrics/workers/calculateMetricsWorker.ts

 import { logger, setLogLevelByWorkerData } from '../../../shared/logger.js';
 import type { TokenEncoding } from '../TokenCounter.js';
-import { freeTokenCounters, getTokenCounter } from '../tokenCounterFactory.js';
+import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks } from '../tokenCounterFactory.js';


Import the new guard function to check for initialization status.

Suggested change

import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks } from '../tokenCounterFactory.js';

import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks, isTokenCounterInitialized } from '../tokenCounterFactory.js';

gemini-code-assist · 2026-04-09T23:48:45Z

src/core/metrics/workers/calculateMetricsWorker.ts

+    if (task.bpeRanksJson) {
+      try {
+        const bpeRanks = JSON.parse(task.bpeRanksJson);
+        initTokenCounterFromBpeRanks(task.encoding, bpeRanks);
+      } catch {
+        // Fall through to getTokenCounter which loads from disk
+      }
+    }


The worker currently parses the bpeRanksJson string every time it's provided in a task. While it's primarily intended for warmup tasks, adding a guard check here prevents expensive and redundant JSON parsing (of a ~1.6MB string) if multiple tasks with BPE data are received by the same worker before it has finished initializing.

Suggested change

if (task.bpeRanksJson) {

try {

const bpeRanks = JSON.parse(task.bpeRanksJson);

initTokenCounterFromBpeRanks(task.encoding, bpeRanks);

} catch {

// Fall through to getTokenCounter which loads from disk

}

}

if (task.bpeRanksJson && !isTokenCounterInitialized(task.encoding)) {

try {

const bpeRanks = JSON.parse(task.bpeRanksJson);

initTokenCounterFromBpeRanks(task.encoding, bpeRanks);

} catch {

// Fall through to getTokenCounter which loads from disk

}

}

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

coderabbitai bot approved these changes Apr 9, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

devin-ai-integration bot reviewed Apr 9, 2026

View reviewed changes

yamadashy closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O#1434

perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O#1434
yamadashy wants to merge 1 commit intomainfrom
perf/bpe-preload-main-thread

yamadashy commented Apr 9, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 inconclusive)

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks } from '../tokenCounterFactory.js';
	import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks, isTokenCounterInitialized } from '../tokenCounterFactory.js';

Uh oh!

Conversation

yamadashy commented Apr 9, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 inconclusive)

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Performance Benchmark

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yamadashy commented Apr 9, 2026 •

edited by devin-ai-integration bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

codecov bot commented Apr 9, 2026 •

edited

Loading