perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O#1434
perf(core): Pre-load BPE data on main thread to eliminate redundant per-worker file I/O#1434
Conversation
…er-worker file I/O Each metrics worker thread independently loaded gpt-tokenizer's BPE rank data (~3.6MB, 200K entries) from disk. Now the BPE data is loaded once on the main thread, serialized to a JSON string (~1.6MB), and passed to each worker via the warmup task. Workers deserialize and build the encoder instead of reading from disk. Cherry-picked from fd6b625 (PR #1428) Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis change introduces BPE rank preloading capabilities across the metrics system. New APIs enable main-thread preloading of encoding data, serialization to JSON, and transmission to workers, which deserialize and initialize token counters from preloaded ranks. Falls back to disk-based initialization if deserialization fails. Changes
Sequence DiagramsequenceDiagram
participant MT as Main Thread
participant TC as TokenCounter
participant WM as Warmup Manager
participant W as Worker Thread
MT->>TC: loadBpeRanks(encoding)
TC-->>MT: BpeRanks (Promise)
activate MT
MT->>MT: JSON.stringify(bpeRanks)
deactivate MT
MT->>WM: createMetricsTaskRunner(numTasks, encoding)
WM->>W: warmup with {content, encoding, bpeRanksJson}
W->>W: JSON.parse(bpeRanksJson)
W->>TC: initTokenCounterFromBpeRanks(encoding, bpeRanks)
TC-->>W: TokenCounter initialized
W->>W: countTokens(task)
W-->>WM: token count result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
⚡ Performance Benchmark
Details
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1434 +/- ##
==========================================
- Coverage 87.32% 87.03% -0.29%
==========================================
Files 117 117
Lines 4426 4451 +25
Branches 1022 1025 +3
==========================================
+ Hits 3865 3874 +9
- Misses 561 577 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/core/metrics/workers/calculateMetricsWorker.ts (1)
44-55: Consider logging parse failures at debug level.The silent catch provides a graceful fallback to disk loading, but swallowing all errors makes debugging difficult if deserialization fails unexpectedly (e.g., corrupted data, memory issues).
💡 Optional: Add debug logging for parse failures
if (task.bpeRanksJson) { try { const bpeRanks = JSON.parse(task.bpeRanksJson); initTokenCounterFromBpeRanks(task.encoding, bpeRanks); - } catch { + } catch (error) { + logger.debug('Failed to parse pre-loaded BPE data, falling back to disk:', error); // Fall through to getTokenCounter which loads from disk } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/metrics/workers/calculateMetricsWorker.ts` around lines 44 - 55, The JSON.parse swallow in the bpe warmup path hides useful failure details; update the catch in the block that checks task.bpeRanksJson to catch the error (catch (err)) and emit a debug-level log including the error and contextual info (e.g., task.encoding and that bpeRanksJson failed) before falling back to getTokenCounter; use the module's logger (e.g., processLogger.debug or the existing debug logger) and keep calling initTokenCounterFromBpeRanks on success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/core/metrics/workers/calculateMetricsWorker.ts`:
- Around line 44-55: The JSON.parse swallow in the bpe warmup path hides useful
failure details; update the catch in the block that checks task.bpeRanksJson to
catch the error (catch (err)) and emit a debug-level log including the error and
contextual info (e.g., task.encoding and that bpeRanksJson failed) before
falling back to getTokenCounter; use the module's logger (e.g.,
processLogger.debug or the existing debug logger) and keep calling
initTokenCounterFromBpeRanks on success.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f3b01924-3306-4c50-b52f-a2380a70d68e
📒 Files selected for processing (5)
src/core/metrics/TokenCounter.tssrc/core/metrics/calculateMetrics.tssrc/core/metrics/tokenCounterFactory.tssrc/core/metrics/workers/calculateMetricsWorker.tstests/core/metrics/calculateMetrics.test.ts
There was a problem hiding this comment.
Code Review
This pull request optimizes the initialization of the TokenCounter by pre-loading BPE rank data on the main thread and passing it to worker threads via IPC, significantly reducing per-worker file I/O overhead. The review feedback highlights a potential race condition regarding task queuing and suggests adding a guard function to prevent redundant JSON parsing in worker threads. I have included the suggested improvements to the worker initialization logic to ensure efficiency.
| const warmupPromise = bpeRanksJsonPromise.then((bpeRanksJson) => | ||
| Promise.all( | ||
| Array.from({ length: maxThreads }, () => | ||
| taskRunner.run({ content: '', encoding, ...(bpeRanksJson != null && { bpeRanksJson }) }).catch(() => 0), | ||
| ), | ||
| ), | ||
| ); |
There was a problem hiding this comment.
There is a potential race condition in how the warmup tasks are queued. createMetricsTaskRunner returns the taskRunner immediately, but the warmup tasks are only queued after the BPE data is loaded and serialized (which is an async operation). If the caller submits real tasks to the runner immediately after calling createMetricsTaskRunner, those tasks will likely be queued before the warmup tasks, causing the workers to perform the expensive disk I/O anyway.
To ensure the optimization is effective, the caller must await the warmupPromise before submitting real tasks, or the initialization logic should be adjusted to ensure warmup tasks are prioritized at the head of the queue.
| import { logger, setLogLevelByWorkerData } from '../../../shared/logger.js'; | ||
| import type { TokenEncoding } from '../TokenCounter.js'; | ||
| import { freeTokenCounters, getTokenCounter } from '../tokenCounterFactory.js'; | ||
| import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks } from '../tokenCounterFactory.js'; |
There was a problem hiding this comment.
Import the new guard function to check for initialization status.
| import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks } from '../tokenCounterFactory.js'; | |
| import { freeTokenCounters, getTokenCounter, initTokenCounterFromBpeRanks, isTokenCounterInitialized } from '../tokenCounterFactory.js'; |
| if (task.bpeRanksJson) { | ||
| try { | ||
| const bpeRanks = JSON.parse(task.bpeRanksJson); | ||
| initTokenCounterFromBpeRanks(task.encoding, bpeRanks); | ||
| } catch { | ||
| // Fall through to getTokenCounter which loads from disk | ||
| } | ||
| } |
There was a problem hiding this comment.
The worker currently parses the bpeRanksJson string every time it's provided in a task. While it's primarily intended for warmup tasks, adding a guard check here prevents expensive and redundant JSON parsing (of a ~1.6MB string) if multiple tasks with BPE data are received by the same worker before it has finished initializing.
| if (task.bpeRanksJson) { | |
| try { | |
| const bpeRanks = JSON.parse(task.bpeRanksJson); | |
| initTokenCounterFromBpeRanks(task.encoding, bpeRanks); | |
| } catch { | |
| // Fall through to getTokenCounter which loads from disk | |
| } | |
| } | |
| if (task.bpeRanksJson && !isTokenCounterInitialized(task.encoding)) { | |
| try { | |
| const bpeRanks = JSON.parse(task.bpeRanksJson); | |
| initTokenCounterFromBpeRanks(task.encoding, bpeRanks); | |
| } catch { | |
| // Fall through to getTokenCounter which loads from disk | |
| } | |
| } |
Summary
Cherry-picked from fd6b625 (PR #1428)
Test plan