Skip to content

refactor(metrics): Replace tiktoken with gpt-tokenizer#1245

Closed
yamadashy wants to merge 8 commits intomainfrom
refactor/replace-tiktoken-with-gpt-tokenizer
Closed

refactor(metrics): Replace tiktoken with gpt-tokenizer#1245
yamadashy wants to merge 8 commits intomainfrom
refactor/replace-tiktoken-with-gpt-tokenizer

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Mar 19, 2026

Replace the WASM-based tiktoken library with gpt-tokenizer, a pure JavaScript BPE tokenizer implementation. This eliminates the native/WASM binary dependency while maintaining identical token count results across all encodings.

Changes

  • gpt-tokenizer added as production dependency, tiktoken fully removed
  • New TokenEncoding type replaces TiktokenEncoding from tiktoken
  • TokenCounter changed to async factory (static async create()) using resolveEncodingAsync to dynamically import only the needed BPE encoding data (~2.2MB for o200k_base) instead of all encodings (~4.1MB)
  • tokenCounterFactory / calculateMetricsWorker updated for async initialization
  • Encoding validation added to config schema (z.enum() instead of unchecked string cast)
  • Updated website/server Dockerfile and bundle script for gpt-tokenizer

End-to-End Benchmark (full repository)

Environment tiktoken (main) gpt-tokenizer (this PR) Diff
Linux x86 (median, 7 runs) 7903ms 4186ms -47%
macOS M2 (mean ± σ, hyperfine 10 runs) 1.353s ± 0.069s 1.355s ± 0.045s same (1.00x)

Linux (CI, Cloud Run) では大幅に速く、macOS (local CLI) では同等。

Encoding Compatibility

Encoding tiktoken gpt-tokenizer Notes
o200k_base yes yes Default, GPT-4o/o1/o3
cl100k_base yes yes GPT-4, GPT-3.5-turbo
p50k_base yes yes Legacy
p50k_edit yes yes Legacy
r50k_base yes yes Legacy
o200k_harmony no yes Added: open-weight models
gpt2 yes no Dropped: very old, not used in modern LLMs

The only encoding lost is gpt2 (GPT-2 era), which is not used by any current model.

Benefits

  • No WASM/native binary dependency - simpler build and deployment
  • No explicit resource cleanup (free()) needed
  • Lazy loading of BPE data via dynamic import (only loads requested encoding)
  • Runtime validation of encoding names in config
  • Significant speedup on Linux x86 (Cloud Run / CI)

Checklist

  • Run npm run test
  • Run npm run lint

@coderabbitai

This comment has been minimized.

@gemini-code-assist

This comment has been minimized.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 87.24%. Comparing base (ad7abc9) to head (37f632b).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/core/metrics/tokenCounterFactory.ts 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1245      +/-   ##
==========================================
+ Coverage   87.18%   87.24%   +0.05%     
==========================================
  Files         115      116       +1     
  Lines        4324     4328       +4     
  Branches     1002     1004       +2     
==========================================
+ Hits         3770     3776       +6     
+ Misses        554      552       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

This comment has been minimized.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 19, 2026

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 37f632b
Status: ✅  Deploy successful!
Preview URL: https://92d7f332.repomix.pages.dev
Branch Preview URL: https://refactor-replace-tiktoken-wi.repomix.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

coderabbitai[bot]

This comment was marked as resolved.

gemini-code-assist[bot]

This comment was marked as resolved.

@claude

This comment has been minimized.

devin-ai-integration[bot]

This comment was marked as resolved.

@claude

This comment has been minimized.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 20, 2026

PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer

Overall this is a well-executed migration with impressive benchmark results. The pre-built encoding sharing via structured clone is a clever optimization. A few items to consider:

Fragility of Internal API Access

The encodingCache.ts module accesses gpt-tokenizer internals via type assertions (bytePairEncodingCoreProcessor, bytePairRankDecoder, etc.). This couples tightly to implementation details that can change without notice in a minor/patch release.

Details

Both preBuildEncodingData and restoreEncodingFromData rely on internal property names like bytePairEncodingCoreProcessor, bytePairStringRankEncoder, mergeCacheSize, etc. These are not part of gpt-tokenizer's public API.

Mitigations to consider:

  • Pin gpt-tokenizer to an exact version (e.g., "3.4.0" instead of "^3.4.0") to prevent silent breakage on update
  • Add a smoke test that validates the pre-build/restore round-trip produces correct token counts (not just mocked)
  • Add a comment documenting which gpt-tokenizer version these internals were verified against

TokenEncoding Type Maintenance

The hand-maintained TokenEncoding union type in tokenEncoding.ts could drift from what gpt-tokenizer actually supports.

Details

If gpt-tokenizer adds or removes an encoding, this type won't reflect it. Consider deriving it from the library if possible, or adding a comment noting it must be kept in sync.

Also, o200k_harmony is listed — verify this encoding is actually supported by gpt-tokenizer v3.4.0.

extraWorkerData Spreading Behavior

In processConcurrency.ts:88, extraWorkerData is spread into workerData with ...extraWorkerData. If extraWorkerData contained a workerType or logLevel key, it would silently override the required fields.

Details

The spread is currently:

workerData: {
  workerType,
  logLevel: logger.getLogLevel(),
  ...extraWorkerData,
},

Consider spreading extraWorkerData first, or validating that it doesn't conflict:

workerData: {
  ...extraWorkerData,
  workerType,
  logLevel: logger.getLogLevel(),
},

This is low risk since the caller is internal, but defensive ordering is cheap.

freeTokenCounters Is Now Misleading

The function logs "Freed TokenCounter resources" but does nothing (since free() is a no-op). Consider simplifying to just tokenCounters.clear() without the loop/log, or updating the log message.

bytePairRankDecoder Typed as unknown

In the EncodingData interface, bytePairRankDecoder is typed unknown while all other fields have specific types. If the actual type is known, specifying it would improve type safety and catch serialization issues earlier.

CI Status

Most checks are passing. Some Windows and macOS tests are still pending — worth confirming those pass before merge, especially since this removes a native/WASM dependency which could have platform-specific behavior.

Premortem Analysis

Scenario Risk Mitigation
gpt-tokenizer patch release changes internal property names High — silent breakage in pre-build path Pin exact version; add integration test for round-trip
Structured clone fails for RegExp or Map in certain Node versions Low — structured clone supports these since Node 17+ Already require Node 20+
child_process runtime doesn't receive workerData Medium — pre-built data silently unused Already handled: falls back to slow path. But no warning is logged when preBuiltEncodingData is undefined in worker_threads mode
Token counts diverge from tiktoken for edge cases Low — benchmarks show 100% match Consider keeping a small set of golden-file token count tests

Summary

Strong PR — the performance wins are substantial and well-documented. The main concern is the fragility of reaching into gpt-tokenizer internals for the pre-build optimization. Pinning the dependency version and adding a round-trip integration test would significantly reduce the risk of silent breakage.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 20, 2026

PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer (Updated)

This is a clean, well-executed migration. The final state after the 5-commit iteration — removing the pre-built encoding cache and adding z.enum() validation — lands in a good place. The code is simpler and more maintainable than the intermediate states.

What looks good

  • z.enum(tokenEncodings) in configSchema.ts — addresses the earlier CodeRabbit concern about unchecked string casts. Runtime validation now catches invalid encoding names at config parse time.
  • Simplified freeTokenCounters — correctly reduced to just tokenCounters.clear() with an accurate log message. No misleading "freed resources" language.
  • Removal of encodingCache.ts — good call removing the internal API access to gpt-tokenizer. The fragility wasn't worth the ~120ms improvement given Tinypool reuses workers.
  • Tests updated properly — mocks correctly target the new gpt-tokenizer/GptEncoding and gpt-tokenizer/resolveEncoding subpath imports.

Items to verify

1. o200k_harmony encoding validity

The tokenEncodings array in tokenEncoding.ts includes o200k_harmony. This encoding isn't widely documented. Please verify it's actually supported by gpt-tokenizer@3.4.0 — if it's not, users who specify it in their config will pass Zod validation but hit a runtime error in GptEncoding.getEncodingApi().

A quick smoke test (new TokenCounter('o200k_harmony')) would confirm.

2. Base schema still accepts any string for encoding

The repomixConfigBaseSchema (line 70-74) still uses z.string().optional() for tokenCount.encoding, while repomixConfigDefaultSchema uses z.enum(tokenEncodings). This is likely intentional (loose validation for file configs, strict for defaults), but it means an invalid encoding in repomix.config.json could pass repomixConfigBaseSchema validation but fail when merged with defaults. Worth confirming this is the desired behavior or if the base schema should also validate.

3. new Set() created on every encode call

In TokenCounter.ts:28, { disallowedSpecial: new Set() } allocates a new Set on every countTokens call. For high-throughput paths (many files), this creates unnecessary GC pressure. Consider hoisting it to a module-level constant:

const ENCODE_OPTIONS = { disallowedSpecial: new Set<string>() };
// ...
return this.encoding.encode(content, ENCODE_OPTIONS).length;

Minor optimization, but free and consistent with the performance focus of this PR.

4. Dockerfile comment still says "WASM files"

website/server/Dockerfile:31 comment says "Copy bundled server and WASM files" but tiktoken WASM is no longer used. The comment on line 35 was correctly updated, but line 31 still references WASM.

Premortem

Scenario Risk Notes
gpt-tokenizer major version bump changes GptEncoding.getEncodingApi signature Medium Standard semver risk. ^3.4.0 range is fine.
o200k_harmony not supported → runtime crash Medium Only if a user explicitly configures it. Verify once.
Token counts diverge from tiktoken on edge cases not covered by benchmarks Low Benchmarks show 100% match. Consider a golden-file test with a few known inputs if paranoia warrants it.

Summary

Clean migration with significant performance wins. The main open question is verifying o200k_harmony support. Everything else looks solid and ready to merge.


Review by Claude

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 20, 2026

PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer (Review #3)

Great iteration — the final state is clean and well-structured. Previous concerns about encodingCache.ts internal API fragility, extraWorkerData spreading, and misleading log messages have all been addressed. Here's what remains:

Remaining items from prior reviews

1. o200k_harmony — please verify it works

tokenEncoding.ts lists o200k_harmony and it passes z.enum() validation, but if gpt-tokenizer@3.4.0 doesn't actually support it, users who configure it will get a runtime error from GptEncoding.getEncodingApi(). A quick manual test (new TokenCounter('o200k_harmony')) would confirm.

2. Base schema accepts any string for encoding

repomixConfigBaseSchema (line 70-72) uses z.string().optional() for tokenCount.encoding, while repomixConfigDefaultSchema uses z.enum(tokenEncodings). This means an invalid encoding like "banana" passes the base schema but fails when merged. This is likely intentional (loose file parsing, strict defaults), but worth confirming the merge logic handles the mismatch gracefully with a clear error message.

3. new Set() allocated on every countTokens call

In TokenCounter.ts:28, { disallowedSpecial: new Set() } creates a new Set per call. For repos with thousands of files, this adds unnecessary GC pressure. Trivial fix — hoist to a module-level constant:

const ENCODE_OPTIONS = { disallowedSpecial: new Set<string>() } as const;
4. Dockerfile line 30 still references WASM

website/server/Dockerfile:30 comment says "Copy bundled server and WASM files" — the tiktoken WASM reference is stale. Line 35's comment was correctly updated.

New observation

5. REPOMIX_WASM_DIR env var in Dockerfile

website/server/Dockerfile:47 still sets REPOMIX_WASM_DIR=/app/dist-bundled/wasm. If this was only used by tiktoken, it's now dead config. If it's used by tree-sitter WASM, ignore this — but the name could be clarified.

Summary

This is a solid migration with clear performance wins and good test coverage. The code is simpler and more maintainable than the tiktoken version. The items above are all minor — verifying o200k_harmony support is the most important one before merge.


Review by Claude

@yamadashy
Copy link
Copy Markdown
Owner Author

MacBook Pro M2 での検証結果

ローカル環境(MacBook Pro M2)で実際に repomix を実行したところ、gpt-tokenizer 版の方が遅くなることが確認されました。

tiktoken (main) gpt-tokenizer (this PR)
repomix 実行時間 1.5〜1.7s 1.8〜2.0s
TokenCounter init ~200ms ~100ms

TokenCounter の初期化は速くなっていますが、全体の実行時間は約 200-300ms 遅い結果でした。

原因の推測

CI 環境(Linux x86)では gpt-tokenizer(pure JS)が tiktoken(WASM/Rust)より高速でしたが、Apple Silicon では逆転しています。これは:

  • tiktoken の WASM は Rust からコンパイルされており、M2 の高性能な WASM 実行エンジンで非常に効率的に動作する
  • pure JS の BPE 実装は V8 の JIT に依存するため、ネイティブに近い WASM の速度には及ばない

ユーザーの実行環境(ローカルマシン)で遅くなるのは本末転倒なので、この PR はクローズし、tiktoken を維持する方向(#1243)で進めるのが適切と判断します。

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 20, 2026

PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer (Final)

Assessment

The code changes themselves are well-executed — clean migration, proper async factory pattern, good test updates, and the z.enum() validation improvement is a genuine win. However, the maintainer's real-world benchmarking on Apple Silicon showing 200-300ms regression is the decisive factor here.

Agreement with closure decision

The CI benchmark results (4.8-14.8x faster) were misleading because they measured single-thread encoding throughput on Linux x86, not end-to-end wall time on the platforms users actually run Repomix on. tiktoken's Rust-compiled WASM benefits significantly from Apple Silicon's WASM execution engine, which pure JS BPE can't match despite V8 JIT.

Worth preserving from this PR

One improvement from this PR that could be backported to the tiktoken-based code:

  • z.enum() validation for encoding — the current main branch still uses .transform((val) => val as TiktokenEncoding) which is an unchecked cast. Adding runtime validation with z.enum() against known tiktoken encodings would catch invalid config values early. This could be a small standalone PR.

Review by Claude

devin-ai-integration[bot]

This comment was marked as resolved.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 20, 2026

PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer

Status: Agree with closure decision

The maintainer's Apple Silicon benchmarks (200-300ms regression) are the decisive factor. CI benchmarks on Linux x86 were misleading for real-world usage — tiktoken's Rust-compiled WASM runs very efficiently on Apple Silicon's WASM engine, outperforming pure JS BPE despite V8 JIT.

Code quality (for the record)

The implementation is well-executed: clean async factory pattern, proper z.enum() validation, correctly updated tests and Dockerfile. No bugs or security concerns.

Backport suggestion

The z.enum(tokenEncodings) validation replacing the unchecked .transform((val) => val as TiktokenEncoding) cast is a genuine improvement worth backporting to the tiktoken codebase as a standalone PR.


Review by Claude

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

"fast-xml-parser": "^5.4.1",
"git-url-parse": "^16.1.0",
"globby": "^16.1.1",
"gpt-tokenizer": "^3.4.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 README.md not updated after tiktoken → gpt-tokenizer migration (CONTRIBUTING.md violation)

CONTRIBUTING.md requires: "You have updated relevant documentation (especially README.md) if you've added or changed functionality." This PR replaces tiktoken with gpt-tokenizer but does not update the README.md, which still contains two now-incorrect references to tiktoken:

  • README.md:1360 describes tokenCount.encoding as using "OpenAI's tiktoken tokenizer" and links to tiktoken's GitHub/model.py.
  • README.md:1791 lists tiktoken as an external bundling dependency that "Loads WASM files dynamically at runtime" — but gpt-tokenizer is pure JavaScript and does not use WASM.

Both references are factually incorrect after this change and will mislead users.

Prompt for agents
Update README.md in two places to reflect the migration from tiktoken to gpt-tokenizer:

1. README.md line 1360: Change the tokenCount.encoding description from referencing tiktoken to referencing gpt-tokenizer. Replace the tiktoken links with appropriate gpt-tokenizer references. For example: "Token count encoding (e.g., o200k_base for GPT-4o, cl100k_base for GPT-4/3.5)."

2. README.md line 1791: Change the external bundling dependency from "tiktoken - Loads WASM files dynamically at runtime" to "gpt-tokenizer - Loads encoding data files at runtime" (since gpt-tokenizer is pure JS, not WASM-based).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

yamadashy and others added 8 commits March 21, 2026 01:03
…ting

Replace the WASM-based tiktoken library with gpt-tokenizer, a pure JavaScript
BPE tokenizer implementation. This eliminates the native/WASM binary dependency
while maintaining identical token count results across all encodings.

Key changes:
- Replace tiktoken with gpt-tokenizer in production dependencies
- Move tiktoken to devDependencies (retained for benchmark comparison)
- Introduce TokenEncoding type to replace TiktokenEncoding from tiktoken
- Simplify TokenCounter by removing explicit free() resource management
  (gpt-tokenizer uses standard JS garbage collection)
- Add benchmark script (npm run benchmark-tokenizer) comparing both libraries

Benchmark results show gpt-tokenizer is 4.8-14.8x faster for encoding and
2.5x faster for initialization, with 100% token count consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-build the BPE encoder Map (200K entries, ~60-90ms) once in the main
thread and pass it to workers via workerData structured clone. Workers
restore the encoding instance by directly assigning the pre-built data,
bypassing the expensive BytePairEncodingCore constructor entirely.

- Add encodingCache.ts with preBuildEncodingData/restoreEncodingFromData
- Add extraWorkerData support to WorkerOptions/createWorkerPool
- calculateMetrics pre-builds encoding before creating the worker pool
- Workers detect and use pre-built data from workerData when available
- Add worker pool benchmark script comparing all three approaches

Worker init: 63ms → 0.037ms (~1700x faster)
E2E wall time: 20% faster than tiktoken WASM, 27% faster than scratch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pendency

The benchmarks served their purpose for the migration decision.
Remove them along with the tiktoken devDependency to keep
the install lighter (no more WASM binary download).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eferences

- Fix pre-built encoding data not being read in workers: Tinypool wraps
  workerData as [tinypoolPrivateData, userWorkerData], so access via
  workerData[1] is required (matching the pattern in setLogLevelByWorkerData)
- Update website/server/Dockerfile to copy gpt-tokenizer instead of tiktoken
- Update website/server/scripts/bundle.mjs external dependency list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… validation

Remove the encodingCache.ts module that accessed gpt-tokenizer internal
APIs to pre-build and share encoding data across workers. The complexity
and fragility of depending on private properties is not worth the ~120ms
E2E improvement — workers are pooled by Tinypool and pay init cost once.

Also:
- Add z.enum() validation for encoding names in configSchema (replaces
  unchecked string cast that could cause runtime errors)
- Export tokenEncodings array from tokenEncoding.ts for schema reuse
- Simplify freeTokenCounters to just clear the cache
- Remove extraWorkerData from WorkerOptions (no longer needed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from resolveEncoding (synchronous, imports all 4 encoding data
files ~4.1MB) to resolveEncodingAsync (dynamic import, loads only the
requested encoding). This avoids parsing unused BPE data on every
worker thread.

- TokenCounter: change constructor to private, add static async create()
  factory that uses resolveEncodingAsync for lazy loading
- tokenCounterFactory: make getTokenCounter async
- calculateMetricsWorker: await getTokenCounter()
- Update tests for async factory pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a double-check after await to prevent duplicate TokenCounter
creation if two concurrent calls interleave at the await point.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yamadashy yamadashy force-pushed the refactor/replace-tiktoken-with-gpt-tokenizer branch from 62c7261 to 37f632b Compare March 20, 2026 16:03
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 20, 2026

⚡ Performance Benchmark

Packing the repomix repository with node bin/repomix.cjs

Runner PR main Diff
Ubuntu 2.42s (±0.03s) 2.55s (±0.02s) -0.13s (-5.2%)
macOS 1.94s (±0.14s) 1.85s (±0.12s) +0.09s (+4.9%)
Windows 2.95s (±0.05s) 3.03s (±0.06s) -0.08s (-2.7%)
Details
  • Warmup: 2 runs (discarded)
  • Measurement: 10 runs (median)
  • ±: IQR (Interquartile Range) — middle 50% of measurements spread
  • Workflow run

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 20, 2026

PR Review: refactor(metrics): Replace tiktoken with gpt-tokenizer

Overall this is a clean, well-executed refactoring. The migration from tiktoken (WASM) to gpt-tokenizer (pure JS) reduces build complexity and eliminates native/WASM binary management. Up to 47 percent faster on Linux x86 with parity on macOS.

Highlights: Async factory pattern with lazy BPE loading, z.enum config validation replacing unsafe cast, thorough test updates, proper Dockerfile and bundle config changes.

Issues:

1. Race condition in tokenCounterFactory.ts - Multiple concurrent calls independently call TokenCounter.create(). Consider a Promise-based cache to deduplicate in-flight creation.

2. Base config schema (pre-existing) - tokenCount.encoding is z.string().optional() in base schema; consider z.enum for earlier validation.

3. Minor: free() no-op - Since TokenCounter is internal, the no-op could be removed entirely.

Premortem: Low risk. Main concern is concurrent init (item 1). Dropping gpt2 unlikely to affect users. Dynamic imports properly externalized.

Verdict: Approve - Solid refactoring with good benchmarks and clean API design. Race condition in factory is the only actionable item.

@yamadashy yamadashy closed this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant