perf(core): Replace tiktoken WASM with gpt-tokenizer#1350
Conversation
⚡ Performance Benchmark
Details
History62b7861 perf(core): Replace tiktoken WASM with gpt-tokenizer (pure JS)
b392c00 Revert "perf(core): Match output chunk count to CPU cores instead of fixed 1000"
02cc1c9 perf(core): Match output chunk count to CPU cores instead of fixed 1000
bb67792 perf(core): Replace tiktoken WASM with gpt-tokenizer (pure JS)
|
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Code Review
This pull request replaces the tiktoken library with gpt-tokenizer for token counting, transitioning to a pure JavaScript implementation. Key changes include the introduction of an asynchronous init() method in the TokenCounter class for lazy-loading encoding modules, updates to the configuration schema, and refactoring of metrics calculation logic to support the new async initialization. Feedback was provided regarding a misleading test name in the TokenCounter test suite where the description did not match the test's behavior.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1350 +/- ##
==========================================
- Coverage 87.13% 87.00% -0.14%
==========================================
Files 116 116
Lines 4393 4408 +15
Branches 1020 1022 +2
==========================================
+ Hits 3828 3835 +7
- Misses 565 573 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Deploying repomix with
|
| Latest commit: |
d62bf84
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://846075d1.repomix.pages.dev |
| Branch Preview URL: | https://perf-swap-tiktoken-to-gpt-to.repomix.pages.dev |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/core/metrics/TokenCounter.test.ts (1)
4-4: Consider removing unused logger mock.The logger mock doesn't appear to be used for any assertions in the current test suite. If it's only present to suppress console output during tests, consider whether it's still needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/metrics/TokenCounter.test.ts` at line 4, The vi.mock('../../../src/shared/logger') call in TokenCounter.test.ts appears unused; either remove this unused mock to keep tests clean (delete the vi.mock(...) line) or, if you intended to suppress or assert logging, replace it with an explicit mock usage/assertion against the shared logger export (e.g., spy on the logger methods used by TokenCounter) so the mock is actually referenced; update TokenCounter.test.ts accordingly to remove dead setup or exercise the mock.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/core/metrics/tokenCounterFactory.ts`:
- Around line 11-16: The getTokenCounter function can create duplicate
TokenCounter instances when concurrent callers race because tokenCounters is
only set after awaiting tokenCounter.init(); fix by storing an in-flight
initialization promise keyed by encoding before awaiting so subsequent calls
reuse it: e.g., when getTokenCounter sees no entry in tokenCounters, immediately
create a Promise that constructs new TokenCounter(encoding), calls init(), and
resolves the instance, store that promise in the map (or a separate
tokenCounterInits map) under the encoding, await the promise, then replace the
stored promise with the resolved TokenCounter instance; update references to
TokenCounter, getTokenCounter, tokenCounters (or add tokenCounterInits)
accordingly.
In `@tests/core/metrics/TokenCounter.test.ts`:
- Around line 79-83: Update the test title to accurately describe the behavior
being asserted: change the description string for the test that constructs new
TokenCounter('o200k_base') without calling init() and expects
countTokens('test') to throw, e.g., "should throw when counting tokens if not
initialized" so it matches the actual assertion that TokenCounter.countTokens
throws when not initialized; locate the test referencing TokenCounter and
countTokens and replace the misleading title accordingly.
---
Nitpick comments:
In `@tests/core/metrics/TokenCounter.test.ts`:
- Line 4: The vi.mock('../../../src/shared/logger') call in TokenCounter.test.ts
appears unused; either remove this unused mock to keep tests clean (delete the
vi.mock(...) line) or, if you intended to suppress or assert logging, replace it
with an explicit mock usage/assertion against the shared logger export (e.g.,
spy on the logger methods used by TokenCounter) so the mock is actually
referenced; update TokenCounter.test.ts accordingly to remove dead setup or
exercise the mock.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f1847047-ff89-46a9-8073-50fa5ec8c41b
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (11)
package.jsonsrc/config/configSchema.tssrc/core/metrics/TokenCounter.tssrc/core/metrics/calculateOutputMetrics.tssrc/core/metrics/calculateSelectiveFileMetrics.tssrc/core/metrics/tokenCounterFactory.tssrc/core/metrics/workers/calculateMetricsWorker.tstests/core/metrics/TokenCounter.test.tstests/core/metrics/calculateMetrics.test.tstests/core/metrics/diffTokenCount.test.tstests/core/packager.test.ts
This comment has been minimized.
This comment has been minimized.
b392c00 to
bb67792
Compare
This comment has been minimized.
This comment has been minimized.
Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token
counting, eliminating ~200ms WASM initialization overhead while keeping
the existing worker pool infrastructure for parallel processing.
Changes:
- Swap tiktoken dependency for gpt-tokenizer in package.json
- Rewrite TokenCounter to use gpt-tokenizer's async dynamic import
with lazy-loaded encoding modules cached at module level
- Add TOKEN_ENCODINGS constant with Zod enum validation in config
schema, replacing unsafe type assertion
- Use { disallowedSpecial: new Set() } to match tiktoken's
encode(content, [], []) behavior (treat all text as plain text)
- Add p50k_edit encoding for backward compatibility
- Update worker to handle async getTokenCounter initialization
- Rewrite tests to use real gpt-tokenizer with exact token counts
The worker pool, parallel chunk processing, and pre-warming
infrastructure are preserved — only the underlying tokenizer changes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bb67792 to
62b7861
Compare
Rename misleading test from 'should return 0 for errors when not initialized' to 'should throw when countTokens is called before init' to match the actual assertion (toThrow, not toBe(0)). Add test for cl100k_base encoding to verify the dynamic import path works correctly for non-default encodings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR Review: perf(core): Replace tiktoken WASM with gpt-tokenizerOverall this is a clean, focused PR. The migration is well-executed - the worker pool is preserved, tests are rewritten against the real tokenizer, and the type safety is improved with Recommended1. Config to Core dependency direction ( The project guidelines say to "maintain feature-based directory structure and avoid dependencies between features." Importing Suggestion: Extract 2. Base schema still accepts arbitrary strings ( The default schema (line 125) correctly uses Suggestion: Update the base schema to 3. Unsafe type assertion on dynamic import ( No runtime check that Suggestion: Add a runtime guard before the type assertion. Not critical but worth noting4. Race condition in loadEncoding (
5. Map key type (
6. Tests with hardcoded token counts ( The exact count assertions will break if gpt-tokenizer adjusts tokenization in a patch release. These encodings are standardized so it is unlikely, but worth being aware of. Looks Good
🤖 Generated with Claude Code |
Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token counting, eliminating ~200ms WASM initialization overhead while keeping the existing worker pool infrastructure for parallel processing.
This is a minimal, focused replacement — only the tokenizer library changes. The worker pool, parallel chunk processing, and pre-warming infrastructure are all preserved.
Changes
tiktokendependency forgpt-tokenizerin package.jsonTokenCounterto use gpt-tokenizer's async dynamic import with lazy-loaded encoding modulesTOKEN_ENCODINGSconstant withz.enum()validation in config schema (replaces unsafeval as TiktokenEncodingcast){ disallowedSpecial: new Set() }to match tiktoken'sencode(content, [], [])behaviorp50k_editencoding for backward compatibilitygetTokenCounter()initializationChecklist
npm run testnpm run lint