perf(core): Replace tiktoken WASM with gpt-tokenizer by yamadashy · Pull Request #1350 · yamadashy/repomix

yamadashy · 2026-03-28T15:50:55Z

Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token counting, eliminating ~200ms WASM initialization overhead while keeping the existing worker pool infrastructure for parallel processing.

This is a minimal, focused replacement — only the tokenizer library changes. The worker pool, parallel chunk processing, and pre-warming infrastructure are all preserved.

Changes

Swap tiktoken dependency for gpt-tokenizer in package.json
Rewrite TokenCounter to use gpt-tokenizer's async dynamic import with lazy-loaded encoding modules
Add TOKEN_ENCODINGS constant with z.enum() validation in config schema (replaces unsafe val as TiktokenEncoding cast)
Use { disallowedSpecial: new Set() } to match tiktoken's encode(content, [], []) behavior
Add p50k_edit encoding for backward compatibility
Update worker to handle async getTokenCounter() initialization
Rewrite tests with exact token count assertions against real gpt-tokenizer

Checklist

Run npm run test
Run npm run lint

github-actions · 2026-03-28T15:51:11Z

⚡ Performance Benchmark

Latest commit:	`d62bf84` test(core): Fix test name and add cl100k_base encoding test
Status:	✅ Benchmark complete!
Ubuntu:	2.11s (±0.07s) → 2.04s (±0.03s) · -0.07s (-3.2%)
macOS:	1.04s (±0.05s) → 1.05s (±0.08s) · +0.00s (+0.3%)
Windows:	2.29s (±0.10s) → 2.23s (±0.11s) · -0.05s (-2.4%)

Details

Packing the repomix repository with node bin/repomix.cjs
Warmup: 2 runs (discarded), interleaved execution
Measurement: 20 runs / 30 on macOS (median ± IQR)
Workflow run

History

62b7861 perf(core): Replace tiktoken WASM with gpt-tokenizer (pure JS)

Ubuntu:	1.97s (±0.01s) → 1.87s (±0.02s) · -0.10s (-5.0%)
macOS:	1.21s (±0.16s) → 1.22s (±0.25s) · +0.02s (+1.4%)
Windows:	2.31s (±0.21s) → 2.27s (±0.08s) · -0.04s (-1.9%)

b392c00 Revert "perf(core): Match output chunk count to CPU cores instead of fixed 1000"

Ubuntu:	2.01s (±0.01s) → 1.92s (±0.01s) · -0.09s (-4.6%)
macOS:	1.57s (±0.35s) → 1.68s (±0.34s) · +0.10s (+6.5%)
Windows:	2.25s (±0.02s) → 2.21s (±0.02s) · -0.04s (-2.0%)

02cc1c9 perf(core): Match output chunk count to CPU cores instead of fixed 1000

Ubuntu:	1.97s (±0.02s) → 1.80s (±0.03s) · -0.17s (-8.7%)
macOS:	1.06s (±0.08s) → 1.14s (±0.09s) · +0.08s (+7.9%)
Windows:	2.26s (±0.02s) → 2.22s (±0.02s) · -0.04s (-1.9%)

bb67792 perf(core): Replace tiktoken WASM with gpt-tokenizer (pure JS)

Ubuntu:	1.93s (±0.02s) → 1.80s (±0.03s) · -0.12s (-6.4%)
macOS:	1.08s (±0.08s) → 1.04s (±0.08s) · -0.04s (-3.3%)
Windows:	2.29s (±0.22s) → 2.23s (±0.04s) · -0.06s (-2.4%)

gemini-code-assist

Code Review

This pull request replaces the tiktoken library with gpt-tokenizer for token counting, transitioning to a pure JavaScript implementation. Key changes include the introduction of an asynchronous init() method in the TokenCounter class for lazy-loading encoding modules, updates to the configuration schema, and refactoring of metrics calculation logic to support the new async initialization. Feedback was provided regarding a misleading test name in the TokenCounter test suite where the description did not match the test's behavior.

tests/core/metrics/TokenCounter.test.ts

codecov · 2026-03-28T15:52:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.00%. Comparing base (dbc7aee) to head (d62bf84).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1350      +/-   ##
==========================================
- Coverage   87.13%   87.00%   -0.14%     
==========================================
  Files         116      116              
  Lines        4393     4408      +15     
  Branches     1020     1022       +2     
==========================================
+ Hits         3828     3835       +7     
- Misses        565      573       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cloudflare-workers-and-pages · 2026-03-28T15:52:46Z

Deploying repomix with Cloudflare Pages

Latest commit:	`d62bf84`
Status:	✅ Deploy successful!
Preview URL:	https://846075d1.repomix.pages.dev
Branch Preview URL:	https://perf-swap-tiktoken-to-gpt-to.repomix.pages.dev

View logs

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/core/metrics/TokenCounter.test.ts (1)
4-4: Consider removing unused logger mock.

The logger mock doesn't appear to be used for any assertions in the current test suite. If it's only present to suppress console output during tests, consider whether it's still needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/metrics/TokenCounter.test.ts` at line 4, The
vi.mock('../../../src/shared/logger') call in TokenCounter.test.ts appears
unused; either remove this unused mock to keep tests clean (delete the
vi.mock(...) line) or, if you intended to suppress or assert logging, replace it
with an explicit mock usage/assertion against the shared logger export (e.g.,
spy on the logger methods used by TokenCounter) so the mock is actually
referenced; update TokenCounter.test.ts accordingly to remove dead setup or
exercise the mock.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/metrics/tokenCounterFactory.ts`:
- Around line 11-16: The getTokenCounter function can create duplicate
TokenCounter instances when concurrent callers race because tokenCounters is
only set after awaiting tokenCounter.init(); fix by storing an in-flight
initialization promise keyed by encoding before awaiting so subsequent calls
reuse it: e.g., when getTokenCounter sees no entry in tokenCounters, immediately
create a Promise that constructs new TokenCounter(encoding), calls init(), and
resolves the instance, store that promise in the map (or a separate
tokenCounterInits map) under the encoding, await the promise, then replace the
stored promise with the resolved TokenCounter instance; update references to
TokenCounter, getTokenCounter, tokenCounters (or add tokenCounterInits)
accordingly.

In `@tests/core/metrics/TokenCounter.test.ts`:
- Around line 79-83: Update the test title to accurately describe the behavior
being asserted: change the description string for the test that constructs new
TokenCounter('o200k_base') without calling init() and expects
countTokens('test') to throw, e.g., "should throw when counting tokens if not
initialized" so it matches the actual assertion that TokenCounter.countTokens
throws when not initialized; locate the test referencing TokenCounter and
countTokens and replace the misleading title accordingly.

---

Nitpick comments:
In `@tests/core/metrics/TokenCounter.test.ts`:
- Line 4: The vi.mock('../../../src/shared/logger') call in TokenCounter.test.ts
appears unused; either remove this unused mock to keep tests clean (delete the
vi.mock(...) line) or, if you intended to suppress or assert logging, replace it
with an explicit mock usage/assertion against the shared logger export (e.g.,
spy on the logger methods used by TokenCounter) so the mock is actually
referenced; update TokenCounter.test.ts accordingly to remove dead setup or
exercise the mock.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f1847047-ff89-46a9-8073-50fa5ec8c41b

📥 Commits

Reviewing files that changed from the base of the PR and between 3e1fc1a and bb67792.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (11)

package.json
src/config/configSchema.ts
src/core/metrics/TokenCounter.ts
src/core/metrics/calculateOutputMetrics.ts
src/core/metrics/calculateSelectiveFileMetrics.ts
src/core/metrics/tokenCounterFactory.ts
src/core/metrics/workers/calculateMetricsWorker.ts
tests/core/metrics/TokenCounter.test.ts
tests/core/metrics/calculateMetrics.test.ts
tests/core/metrics/diffTokenCount.test.ts
tests/core/packager.test.ts

src/core/metrics/tokenCounterFactory.ts

tests/core/metrics/TokenCounter.test.ts

Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token counting, eliminating ~200ms WASM initialization overhead while keeping the existing worker pool infrastructure for parallel processing. Changes: - Swap tiktoken dependency for gpt-tokenizer in package.json - Rewrite TokenCounter to use gpt-tokenizer's async dynamic import with lazy-loaded encoding modules cached at module level - Add TOKEN_ENCODINGS constant with Zod enum validation in config schema, replacing unsafe type assertion - Use { disallowedSpecial: new Set() } to match tiktoken's encode(content, [], []) behavior (treat all text as plain text) - Add p50k_edit encoding for backward compatibility - Update worker to handle async getTokenCounter initialization - Rewrite tests to use real gpt-tokenizer with exact token counts The worker pool, parallel chunk processing, and pre-warming infrastructure are preserved — only the underlying tokenizer changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rename misleading test from 'should return 0 for errors when not initialized' to 'should throw when countTokens is called before init' to match the actual assertion (toThrow, not toBe(0)). Add test for cl100k_base encoding to verify the dynamic import path works correctly for non-default encodings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-03-29T13:12:17Z

PR Review: perf(core): Replace tiktoken WASM with gpt-tokenizer

Overall this is a clean, focused PR. The migration is well-executed - the worker pool is preserved, tests are rewritten against the real tokenizer, and the type safety is improved with z.enum() validation. A few items worth considering:

The project guidelines say to "maintain feature-based directory structure and avoid dependencies between features." Importing TOKEN_ENCODINGS from core/metrics/TokenCounter.js into config/configSchema.ts creates a config-to-core/metrics dependency, which inverts the typical direction (core depends on config, not the reverse). Previously, the config imported a type from the external tiktoken package, a neutral dependency.

Suggestion: Extract TOKEN_ENCODINGS and TokenEncoding to a shared location (e.g., src/shared/tokenEncodings.ts) and have both config and core/metrics import from there.

2. Base schema still accepts arbitrary strings (src/config/configSchema.ts:72)

The default schema (line 125) correctly uses z.enum(TOKEN_ENCODINGS), but the base schema (repomixConfigBaseSchema) on line 72 still uses z.string().optional() for tokenCount.encoding. This means user config files can pass any string (e.g., "banana"), which would only fail at runtime during the dynamic import.

Suggestion: Update the base schema to z.enum(TOKEN_ENCODINGS).optional() for early validation at config parse time.

3. Unsafe type assertion on dynamic import (src/core/metrics/TokenCounter.ts:31)

No runtime check that mod.countTokens exists or is a function after the dynamic import. If gpt-tokenizer changes its export shape in a future version, this silently produces undefined and later throws a confusing error.

Suggestion: Add a runtime guard before the type assertion.

Not critical but worth noting

4. Race condition in loadEncoding (TokenCounter.ts:21-38)

loadEncoding is async but does not guard against concurrent calls for the same encoding. If multiple callers invoke it simultaneously before the first completes, the import executes multiple times. This is mostly harmless (pure JS, no resource leak) but could be avoided by caching the Promise instead of the resolved value. The same applies to getTokenCounter in tokenCounterFactory.ts.

5. Map key type (TokenCounter.ts:19)

encodingModules uses Map<string, CountTokensFn> but should be Map<TokenEncoding, CountTokensFn> for type consistency.

6. Tests with hardcoded token counts (TokenCounter.test.ts)

The exact count assertions will break if gpt-tokenizer adjusts tokenization in a patch release. These encodings are standardized so it is unlikely, but worth being aware of.

Looks Good

The z.enum(TOKEN_ENCODINGS) validation is a clear improvement over the previous unsafe val as TiktokenEncoding cast
Keeping the worker pool infrastructure while only swapping the tokenizer is the right approach
The PLAIN_TEXT_OPTIONS constant correctly matches the old tiktoken behavior
Test rewrite from mocked to real tokenizer provides stronger confidence
Benchmark shows consistent improvement on Ubuntu/Windows, validating the WASM overhead reduction claim
Commit messages follow conventions, file sizes are well within limits

🤖 Generated with Claude Code

This comment has been minimized.

Sign in to view

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

tests/core/metrics/TokenCounter.test.ts Outdated Show resolved Hide resolved

yamadashy mentioned this pull request Mar 28, 2026

perf(core): Replace tiktoken WASM with gpt-tokenizer #1343

Closed

2 tasks

devin-ai-integration bot reviewed Mar 28, 2026

View reviewed changes

coderabbitai bot requested changes Mar 28, 2026

View reviewed changes

src/core/metrics/tokenCounterFactory.ts Show resolved Hide resolved

tests/core/metrics/TokenCounter.test.ts Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

yamadashy force-pushed the perf/swap-tiktoken-to-gpt-tokenizer branch from b392c00 to bb67792 Compare March 29, 2026 02:52

This comment has been minimized.

Sign in to view

yamadashy force-pushed the perf/swap-tiktoken-to-gpt-tokenizer branch from bb67792 to 62b7861 Compare March 29, 2026 12:28

coderabbitai bot approved these changes Mar 29, 2026

View reviewed changes

yamadashy merged commit 371920b into main Mar 29, 2026
67 checks passed

yamadashy deleted the perf/swap-tiktoken-to-gpt-tokenizer branch March 29, 2026 13:33

This was referenced Apr 3, 2026

perf(metrics): Reduce output token counting chunks from ~1000 to ~10 #1373

Merged

perf(core): Estimate output tokens via sampling with CV-based fallback #1397

Closed

claude bot mentioned this pull request Apr 6, 2026

docs(core): Replace tiktoken references with gpt-tokenizer #1413

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(core): Replace tiktoken WASM with gpt-tokenizer#1350

perf(core): Replace tiktoken WASM with gpt-tokenizer#1350
yamadashy merged 2 commits intomainfrom
perf/swap-tiktoken-to-gpt-tokenizer

yamadashy commented Mar 28, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

github-actions bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

claude bot commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yamadashy commented Mar 28, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Performance Benchmark

Uh oh!

This comment has been minimized.

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying repomix with Cloudflare Pages

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

claude bot commented Mar 29, 2026

PR Review: perf(core): Replace tiktoken WASM with gpt-tokenizer

Recommended

Looks Good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yamadashy commented Mar 28, 2026 •

edited by devin-ai-integration bot

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading

codecov bot commented Mar 28, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Mar 28, 2026 •

edited

Loading