perf(core): Replace tiktoken WASM with gpt-tokenizer#1343
perf(core): Replace tiktoken WASM with gpt-tokenizer#1343
Conversation
⚡ Performance Benchmark
Details
History
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1343 +/- ##
==========================================
- Coverage 87.13% 86.75% -0.39%
==========================================
Files 116 115 -1
Lines 4393 4386 -7
Branches 1020 1023 +3
==========================================
- Hits 3828 3805 -23
- Misses 565 581 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Deploying repomix with
|
| Latest commit: |
00e2c9f
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://7c163fe5.repomix.pages.dev |
| Branch Preview URL: | https://perf-replace-tiktoken-with-g.repomix.pages.dev |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughReplaced Changes
Sequence DiagramsequenceDiagram
participant Packager as Packager
participant CalcMetrics as calculateMetrics
participant CalcFiles as calculateSelectiveFileMetrics
participant CalcGit as calculateGitDiffMetrics<br/>(and calculateGitLogMetrics)
participant CalcOutput as calculateOutputMetrics
participant Factory as getTokenCounter Factory
participant Counter as TokenCounter
Packager->>CalcMetrics: calculateMetrics()
CalcMetrics->>CalcFiles: calculateSelectiveFileMetrics()
CalcFiles->>Factory: await getTokenCounter(encoding)
Factory->>Counter: new TokenCounter(encoding)
Counter->>Counter: await init() [lazy-load gpt-tokenizer module]
Factory-->>CalcFiles: initialized TokenCounter
CalcFiles->>Counter: countTokens(content)
Counter-->>CalcFiles: token count
CalcFiles-->>CalcMetrics: fileTokenCounts
par Parallel Execution
CalcMetrics->>CalcOutput: calculateOutputMetrics()
CalcOutput->>Factory: await getTokenCounter(encoding)
Factory-->>CalcOutput: TokenCounter
CalcOutput->>Counter: countTokens(content)
Counter-->>CalcOutput: token count
CalcMetrics->>CalcGit: calculateGitDiffMetrics()
CalcGit->>Factory: await getTokenCounter(encoding)
Factory-->>CalcGit: TokenCounter
CalcGit->>Counter: countTokens(content)
Counter-->>CalcGit: token count
end
CalcMetrics-->>Packager: metrics result (main thread, no workers)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
b138e40 to
a0610c9
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
🔍 Code Review —
|
Code Review —
|
…rics pipeline Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token counting. This eliminates ~200ms WASM initialization overhead and removes the need for a dedicated worker pool for metrics calculation. - Swap tiktoken dependency for gpt-tokenizer in package.json - Rewrite TokenCounter to use gpt-tokenizer's encode API - Remove TaskRunner/worker pool infrastructure from calculateMetrics - Remove metricsTaskRunner pre-warming from packager pipeline - Update TokenEncoding type in config schema - Simplify all metrics calculation modules to use direct token counting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ve fallback Replace unsafe type assertion (`val as TokenEncoding`) in config schema with Zod `.enum()` validation using a shared `TOKEN_ENCODINGS` constant. This prevents arbitrary strings from reaching the dynamic import in `TokenCounter.loadEncoding()`. Also remove the ineffective fallback in `TokenCounter.countTokens()` that re-called the same cached function after it already threw, making the retry logic dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e condition Restore special token handling that was lost in the tiktoken-to-gpt-tokenizer migration. Files containing sequences like <|endoftext|> now correctly fall back to countTokens with allowedSpecial: 'all' instead of returning 0. Also memoize in-flight initialization promises in getTokenCounter() to prevent duplicate TokenCounter instances when called concurrently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…token counting
Replace the two-phase approach (fast path without options → fallback with
allowedSpecial: 'all') with a single call using { disallowedSpecial: new Set() }.
The previous approach caused benchmark regressions (macOS +51%, Windows +7.8%)
because gpt-tokenizer's default disallowedSpecial: 'all' throws on any text
matching special token patterns, triggering the costly fallback on many files.
Additionally, allowedSpecial: 'all' had incorrect semantics — it counted
<|endoftext|> as 1 control token instead of 7 text tokens, diverging from
the old tiktoken behavior. Using { disallowedSpecial: new Set() } treats
all content as plain text, matching tiktoken's encode(content, [], []).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid passing options to gpt-tokenizer's countTokens() on every call.
When options are provided, gpt-tokenizer calls processSpecialTokens()
each time instead of using its pre-cached defaultSpecialTokenConfig,
adding significant per-call overhead.
Use the no-options fast path by default, and fall back to
{ disallowedSpecial: new Set() } only for the rare files (~0.1%)
that contain special token sequences like <|endoftext|>.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tTokens V8 deoptimizes the entire try block when the catch block contains non-trivial logic (method calls, nested try/catch). Local benchmarks showed +25% regression (~1050ms → ~1330ms) from adding a fallback retry in the catch path. Keep countTokens() catch block minimal (log + return 0) to match the structure V8 optimizes well. Provide special token handling as a separate countTokensPlainText() method that callers can use when they know content contains special token sequences. Also revert tokenCounterFactory race condition fix — the IIFE-based Promise memoization added complexity without measurable benefit since getTokenCounter is effectively serialized by the pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
187cd03 to
8e30dab
Compare
… metrics
Add p50k_edit to TOKEN_ENCODINGS for backward compatibility with users
who had this tiktoken encoding in their config.
Add countTokensPlainText() method that uses { disallowedSpecial: new Set() }
to match tiktoken's encode(content, [], []) behavior, treating special
tokens like <|endoftext|> as ordinary text.
In the file metrics hot loop, retry with countTokensPlainText() when
countTokens() returns 0 for non-empty files, handling the rare (~0.1%)
files containing special token sequences without affecting hot path
performance.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8e30dab to
fffc7c8
Compare
…st assertions Use countTokensPlainText() for git diff and git log token counting to correctly handle special token sequences in diffs. These are cold paths (1-2 calls each) so the processSpecialTokens overhead is negligible. Strengthen test assertions from toBeGreaterThan(0) to exact token counts verified against gpt-tokenizer's o200k_base encoding. This catches encoding regressions and tokenizer implementation drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The output content often contains special token sequences like <|endoftext|> from packed source files (e.g., TokenCounter.ts comments, release notes). Using countTokens() on this content causes gpt-tokenizer to throw, silently returning 0 for the entire output — making Total Tokens show as 0 in the summary. Switch calculateOutputMetrics and calculateMetricsWorker to use countTokensPlainText() which matches tiktoken's encode(content, [], []) behavior, treating all content as ordinary text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code Review —
|
| export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => { | ||
| let tokenCounter = tokenCounters.get(encoding); | ||
| if (!tokenCounter) { | ||
| tokenCounter = new TokenCounter(encoding); | ||
| await tokenCounter.init(); | ||
| tokenCounters.set(encoding, tokenCounter); | ||
| } | ||
| return tokenCounter; |
There was a problem hiding this comment.
🟡 Race condition in async getTokenCounter allows duplicate TokenCounter creation
The getTokenCounter function has a classic TOCTOU (time-of-check-time-of-use) race condition. It checks the map, then awaits init(), then sets the map. If two concurrent callers request the same encoding before either completes, both see the map as empty, both create and initialize a TokenCounter, and the second one overwrites the first in the map. The first caller gets a TokenCounter instance that is not stored in the cache.
The comment in calculateMetrics.ts:58-59 acknowledges sequential ordering is needed ("File metrics must run first... then output/git metrics can run in parallel since they share the cached TokenCounter"), but if calculateSelectiveFileMetrics returns early at calculateSelectiveFileMetrics.ts:24-26 (when filesToProcess.length === 0), the counter is never cached and the subsequent Promise.all at calculateMetrics.ts:67 triggers concurrent calls to getTokenCounter from calculateOutputMetrics, calculateGitDiffMetrics, and calculateGitLogMetrics.
The fix is to cache the Promise<TokenCounter> instead of the resolved value, ensuring all concurrent callers share the same initialization promise.
| export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => { | |
| let tokenCounter = tokenCounters.get(encoding); | |
| if (!tokenCounter) { | |
| tokenCounter = new TokenCounter(encoding); | |
| await tokenCounter.init(); | |
| tokenCounters.set(encoding, tokenCounter); | |
| } | |
| return tokenCounter; | |
| export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => { | |
| let pending = tokenCounterPromises.get(encoding); | |
| if (!pending) { | |
| pending = (async () => { | |
| const tokenCounter = new TokenCounter(encoding); | |
| await tokenCounter.init(); | |
| tokenCounters.set(encoding, tokenCounter); | |
| return tokenCounter; | |
| })(); | |
| tokenCounterPromises.set(encoding, pending); | |
| } | |
| return pending; | |
| }; |
Was this helpful? React with 👍 or 👎 to provide feedback.
The calculateMetricsWorker and its 'calculateMetrics' worker type are no longer used — all token counting now runs on the main thread via gpt-tokenizer. Remove the worker file, its case branches in processConcurrency and unifiedWorker, and the associated tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the same retry pattern as file metrics: try countTokens() first (fast path, uses cached defaultSpecialTokenConfig), and only fall back to countTokensPlainText() when the result is 0 for non-empty content. This avoids processSpecialTokens() overhead for repos without special token sequences while still correctly counting tokens when they appear. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Background and Change of ApproachThis PR replaced tiktoken (WASM) with gpt-tokenizer (pure JS) and removed the worker pool to run token counting directly on the main thread. Issues Discovered
New ApproachBased on these findings, the optimal approach is to keep the worker pool and only swap the tokenizer library. Created #1350 with this approach:
Learnings from this PR (Zod enum validation, |
tiktoken uses WASM which adds ~200ms initialization overhead. gpt-tokenizer is a pure JS implementation that eliminates this cost and removes the need for a dedicated worker pool for metrics calculation. This simplifies the metrics pipeline by removing TaskRunner/worker pool infrastructure.
Benchmark improvements: Ubuntu -0.73s, Windows -0.74s, macOS -0.52s.
Changes
tiktokendependency forgpt-tokenizerin package.jsonTokenCounterto use gpt-tokenizer's encode APITaskRunner/worker pool infrastructure fromcalculateMetricsmetricsTaskRunnerpre-warming from packager pipelineTokenEncodingtype in config schemaChecklist
npm run testnpm run lint