refactor(metrics): Replace tiktoken with gpt-tokenizer#1245
refactor(metrics): Replace tiktoken with gpt-tokenizer#1245
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1245 +/- ##
==========================================
+ Coverage 87.18% 87.24% +0.05%
==========================================
Files 115 116 +1
Lines 4324 4328 +4
Branches 1002 1004 +2
==========================================
+ Hits 3770 3776 +6
+ Misses 554 552 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
Deploying repomix with
|
| Latest commit: |
37f632b
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://92d7f332.repomix.pages.dev |
| Branch Preview URL: | https://refactor-replace-tiktoken-wi.repomix.pages.dev |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizerOverall this is a well-executed migration with impressive benchmark results. The pre-built encoding sharing via structured clone is a clever optimization. A few items to consider: Fragility of Internal API AccessThe DetailsBoth Mitigations to consider:
|
| Scenario | Risk | Mitigation |
|---|---|---|
gpt-tokenizer patch release changes internal property names |
High — silent breakage in pre-build path | Pin exact version; add integration test for round-trip |
Structured clone fails for RegExp or Map in certain Node versions |
Low — structured clone supports these since Node 17+ | Already require Node 20+ |
child_process runtime doesn't receive workerData |
Medium — pre-built data silently unused | Already handled: falls back to slow path. But no warning is logged when preBuiltEncodingData is undefined in worker_threads mode |
| Token counts diverge from tiktoken for edge cases | Low — benchmarks show 100% match | Consider keeping a small set of golden-file token count tests |
Summary
Strong PR — the performance wins are substantial and well-documented. The main concern is the fragility of reaching into gpt-tokenizer internals for the pre-build optimization. Pinning the dependency version and adding a round-trip integration test would significantly reduce the risk of silent breakage.
PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer (Updated)This is a clean, well-executed migration. The final state after the 5-commit iteration — removing the pre-built encoding cache and adding What looks good
Items to verify1.
|
| Scenario | Risk | Notes |
|---|---|---|
gpt-tokenizer major version bump changes GptEncoding.getEncodingApi signature |
Medium | Standard semver risk. ^3.4.0 range is fine. |
o200k_harmony not supported → runtime crash |
Medium | Only if a user explicitly configures it. Verify once. |
| Token counts diverge from tiktoken on edge cases not covered by benchmarks | Low | Benchmarks show 100% match. Consider a golden-file test with a few known inputs if paranoia warrants it. |
Summary
Clean migration with significant performance wins. The main open question is verifying o200k_harmony support. Everything else looks solid and ready to merge.
Review by Claude
PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer (Review #3)Great iteration — the final state is clean and well-structured. Previous concerns about Remaining items from prior reviews1.
|
MacBook Pro M2 での検証結果ローカル環境(MacBook Pro M2)で実際に repomix を実行したところ、gpt-tokenizer 版の方が遅くなることが確認されました。
TokenCounter の初期化は速くなっていますが、全体の実行時間は約 200-300ms 遅い結果でした。 原因の推測CI 環境(Linux x86)では gpt-tokenizer(pure JS)が tiktoken(WASM/Rust)より高速でしたが、Apple Silicon では逆転しています。これは:
ユーザーの実行環境(ローカルマシン)で遅くなるのは本末転倒なので、この PR はクローズし、tiktoken を維持する方向(#1243)で進めるのが適切と判断します。 |
PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer (Final)AssessmentThe code changes themselves are well-executed — clean migration, proper async factory pattern, good test updates, and the Agreement with closure decisionThe CI benchmark results (4.8-14.8x faster) were misleading because they measured single-thread encoding throughput on Linux x86, not end-to-end wall time on the platforms users actually run Repomix on. tiktoken's Rust-compiled WASM benefits significantly from Apple Silicon's WASM execution engine, which pure JS BPE can't match despite V8 JIT. Worth preserving from this PROne improvement from this PR that could be backported to the tiktoken-based code:
|
PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizerStatus: Agree with closure decisionThe maintainer's Apple Silicon benchmarks (200-300ms regression) are the decisive factor. CI benchmarks on Linux x86 were misleading for real-world usage — tiktoken's Rust-compiled WASM runs very efficiently on Apple Silicon's WASM engine, outperforming pure JS BPE despite V8 JIT. Code quality (for the record)The implementation is well-executed: clean async factory pattern, proper Backport suggestionThe
|
| "fast-xml-parser": "^5.4.1", | ||
| "git-url-parse": "^16.1.0", | ||
| "globby": "^16.1.1", | ||
| "gpt-tokenizer": "^3.4.0", |
There was a problem hiding this comment.
🔴 README.md not updated after tiktoken → gpt-tokenizer migration (CONTRIBUTING.md violation)
CONTRIBUTING.md requires: "You have updated relevant documentation (especially README.md) if you've added or changed functionality." This PR replaces tiktoken with gpt-tokenizer but does not update the README.md, which still contains two now-incorrect references to tiktoken:
README.md:1360describestokenCount.encodingas using "OpenAI's tiktoken tokenizer" and links to tiktoken's GitHub/model.py.README.md:1791liststiktokenas an external bundling dependency that "Loads WASM files dynamically at runtime" — butgpt-tokenizeris pure JavaScript and does not use WASM.
Both references are factually incorrect after this change and will mislead users.
Prompt for agents
Update README.md in two places to reflect the migration from tiktoken to gpt-tokenizer:
1. README.md line 1360: Change the tokenCount.encoding description from referencing tiktoken to referencing gpt-tokenizer. Replace the tiktoken links with appropriate gpt-tokenizer references. For example: "Token count encoding (e.g., o200k_base for GPT-4o, cl100k_base for GPT-4/3.5)."
2. README.md line 1791: Change the external bundling dependency from "tiktoken - Loads WASM files dynamically at runtime" to "gpt-tokenizer - Loads encoding data files at runtime" (since gpt-tokenizer is pure JS, not WASM-based).
Was this helpful? React with 👍 or 👎 to provide feedback.
…ting Replace the WASM-based tiktoken library with gpt-tokenizer, a pure JavaScript BPE tokenizer implementation. This eliminates the native/WASM binary dependency while maintaining identical token count results across all encodings. Key changes: - Replace tiktoken with gpt-tokenizer in production dependencies - Move tiktoken to devDependencies (retained for benchmark comparison) - Introduce TokenEncoding type to replace TiktokenEncoding from tiktoken - Simplify TokenCounter by removing explicit free() resource management (gpt-tokenizer uses standard JS garbage collection) - Add benchmark script (npm run benchmark-tokenizer) comparing both libraries Benchmark results show gpt-tokenizer is 4.8-14.8x faster for encoding and 2.5x faster for initialization, with 100% token count consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-build the BPE encoder Map (200K entries, ~60-90ms) once in the main thread and pass it to workers via workerData structured clone. Workers restore the encoding instance by directly assigning the pre-built data, bypassing the expensive BytePairEncodingCore constructor entirely. - Add encodingCache.ts with preBuildEncodingData/restoreEncodingFromData - Add extraWorkerData support to WorkerOptions/createWorkerPool - calculateMetrics pre-builds encoding before creating the worker pool - Workers detect and use pre-built data from workerData when available - Add worker pool benchmark script comparing all three approaches Worker init: 63ms → 0.037ms (~1700x faster) E2E wall time: 20% faster than tiktoken WASM, 27% faster than scratch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pendency The benchmarks served their purpose for the migration decision. Remove them along with the tiktoken devDependency to keep the install lighter (no more WASM binary download). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eferences - Fix pre-built encoding data not being read in workers: Tinypool wraps workerData as [tinypoolPrivateData, userWorkerData], so access via workerData[1] is required (matching the pattern in setLogLevelByWorkerData) - Update website/server/Dockerfile to copy gpt-tokenizer instead of tiktoken - Update website/server/scripts/bundle.mjs external dependency list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… validation Remove the encodingCache.ts module that accessed gpt-tokenizer internal APIs to pre-build and share encoding data across workers. The complexity and fragility of depending on private properties is not worth the ~120ms E2E improvement — workers are pooled by Tinypool and pay init cost once. Also: - Add z.enum() validation for encoding names in configSchema (replaces unchecked string cast that could cause runtime errors) - Export tokenEncodings array from tokenEncoding.ts for schema reuse - Simplify freeTokenCounters to just clear the cache - Remove extraWorkerData from WorkerOptions (no longer needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from resolveEncoding (synchronous, imports all 4 encoding data files ~4.1MB) to resolveEncodingAsync (dynamic import, loads only the requested encoding). This avoids parsing unused BPE data on every worker thread. - TokenCounter: change constructor to private, add static async create() factory that uses resolveEncodingAsync for lazy loading - tokenCounterFactory: make getTokenCounter async - calculateMetricsWorker: await getTokenCounter() - Update tests for async factory pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a double-check after await to prevent duplicate TokenCounter creation if two concurrent calls interleave at the await point. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
62c7261 to
37f632b
Compare
⚡ Performance BenchmarkPacking the repomix repository with
Details
|
PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizerOverall this is a clean, well-executed refactoring. The migration from tiktoken (WASM) to gpt-tokenizer (pure JS) reduces build complexity and eliminates native/WASM binary management. Up to 47 percent faster on Linux x86 with parity on macOS. Highlights: Async factory pattern with lazy BPE loading, z.enum config validation replacing unsafe cast, thorough test updates, proper Dockerfile and bundle config changes. Issues: 1. Race condition in tokenCounterFactory.ts - Multiple concurrent calls independently call TokenCounter.create(). Consider a Promise-based cache to deduplicate in-flight creation. 2. Base config schema (pre-existing) - tokenCount.encoding is z.string().optional() in base schema; consider z.enum for earlier validation. 3. Minor: free() no-op - Since TokenCounter is internal, the no-op could be removed entirely. Premortem: Low risk. Main concern is concurrent init (item 1). Dropping gpt2 unlikely to affect users. Dynamic imports properly externalized. Verdict: Approve - Solid refactoring with good benchmarks and clean API design. Race condition in factory is the only actionable item. |
Replace the WASM-based
tiktokenlibrary withgpt-tokenizer, a pure JavaScript BPE tokenizer implementation. This eliminates the native/WASM binary dependency while maintaining identical token count results across all encodings.Changes
gpt-tokenizeradded as production dependency,tiktokenfully removedTokenEncodingtype replacesTiktokenEncodingfrom tiktokenTokenCounterchanged to async factory (static async create()) usingresolveEncodingAsyncto dynamically import only the needed BPE encoding data (~2.2MB for o200k_base) instead of all encodings (~4.1MB)tokenCounterFactory/calculateMetricsWorkerupdated for async initializationz.enum()instead of unchecked string cast)End-to-End Benchmark (full repository)
Linux (CI, Cloud Run) では大幅に速く、macOS (local CLI) では同等。
Encoding Compatibility
The only encoding lost is
gpt2(GPT-2 era), which is not used by any current model.Benefits
Checklist
npm run testnpm run lint