Skip to content

perf(core): Replace tiktoken WASM with gpt-tokenizer#1350

Merged
yamadashy merged 2 commits intomainfrom
perf/swap-tiktoken-to-gpt-tokenizer
Mar 29, 2026
Merged

perf(core): Replace tiktoken WASM with gpt-tokenizer#1350
yamadashy merged 2 commits intomainfrom
perf/swap-tiktoken-to-gpt-tokenizer

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Mar 28, 2026

Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token counting, eliminating ~200ms WASM initialization overhead while keeping the existing worker pool infrastructure for parallel processing.

This is a minimal, focused replacement — only the tokenizer library changes. The worker pool, parallel chunk processing, and pre-warming infrastructure are all preserved.

Changes

  • Swap tiktoken dependency for gpt-tokenizer in package.json
  • Rewrite TokenCounter to use gpt-tokenizer's async dynamic import with lazy-loaded encoding modules
  • Add TOKEN_ENCODINGS constant with z.enum() validation in config schema (replaces unsafe val as TiktokenEncoding cast)
  • Use { disallowedSpecial: new Set() } to match tiktoken's encode(content, [], []) behavior
  • Add p50k_edit encoding for backward compatibility
  • Update worker to handle async getTokenCounter() initialization
  • Rewrite tests with exact token count assertions against real gpt-tokenizer

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

⚡ Performance Benchmark

Latest commit:d62bf84 test(core): Fix test name and add cl100k_base encoding test
Status:✅ Benchmark complete!
Ubuntu:2.11s (±0.07s) → 2.04s (±0.03s) · -0.07s (-3.2%)
macOS:1.04s (±0.05s) → 1.05s (±0.08s) · +0.00s (+0.3%)
Windows:2.29s (±0.10s) → 2.23s (±0.11s) · -0.05s (-2.4%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run
History

62b7861 perf(core): Replace tiktoken WASM with gpt-tokenizer (pure JS)

Ubuntu:1.97s (±0.01s) → 1.87s (±0.02s) · -0.10s (-5.0%)
macOS:1.21s (±0.16s) → 1.22s (±0.25s) · +0.02s (+1.4%)
Windows:2.31s (±0.21s) → 2.27s (±0.08s) · -0.04s (-1.9%)

b392c00 Revert "perf(core): Match output chunk count to CPU cores instead of fixed 1000"

Ubuntu:2.01s (±0.01s) → 1.92s (±0.01s) · -0.09s (-4.6%)
macOS:1.57s (±0.35s) → 1.68s (±0.34s) · +0.10s (+6.5%)
Windows:2.25s (±0.02s) → 2.21s (±0.02s) · -0.04s (-2.0%)

02cc1c9 perf(core): Match output chunk count to CPU cores instead of fixed 1000

Ubuntu:1.97s (±0.02s) → 1.80s (±0.03s) · -0.17s (-8.7%)
macOS:1.06s (±0.08s) → 1.14s (±0.09s) · +0.08s (+7.9%)
Windows:2.26s (±0.02s) → 2.22s (±0.02s) · -0.04s (-1.9%)

bb67792 perf(core): Replace tiktoken WASM with gpt-tokenizer (pure JS)

Ubuntu:1.93s (±0.02s) → 1.80s (±0.03s) · -0.12s (-6.4%)
macOS:1.08s (±0.08s) → 1.04s (±0.08s) · -0.04s (-3.3%)
Windows:2.29s (±0.22s) → 2.23s (±0.04s) · -0.06s (-2.4%)

@coderabbitai

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the tiktoken library with gpt-tokenizer for token counting, transitioning to a pure JavaScript implementation. Key changes include the introduction of an asynchronous init() method in the TokenCounter class for lazy-loading encoding modules, updates to the configuration schema, and refactoring of metrics calculation logic to support the new async initialization. Feedback was provided regarding a misleading test name in the TokenCounter test suite where the description did not match the test's behavior.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.00%. Comparing base (dbc7aee) to head (d62bf84).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1350      +/-   ##
==========================================
- Coverage   87.13%   87.00%   -0.14%     
==========================================
  Files         116      116              
  Lines        4393     4408      +15     
  Branches     1020     1022       +2     
==========================================
+ Hits         3828     3835       +7     
- Misses        565      573       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 28, 2026

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: d62bf84
Status: ✅  Deploy successful!
Preview URL: https://846075d1.repomix.pages.dev
Branch Preview URL: https://perf-swap-tiktoken-to-gpt-to.repomix.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/core/metrics/TokenCounter.test.ts (1)

4-4: Consider removing unused logger mock.

The logger mock doesn't appear to be used for any assertions in the current test suite. If it's only present to suppress console output during tests, consider whether it's still needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/metrics/TokenCounter.test.ts` at line 4, The
vi.mock('../../../src/shared/logger') call in TokenCounter.test.ts appears
unused; either remove this unused mock to keep tests clean (delete the
vi.mock(...) line) or, if you intended to suppress or assert logging, replace it
with an explicit mock usage/assertion against the shared logger export (e.g.,
spy on the logger methods used by TokenCounter) so the mock is actually
referenced; update TokenCounter.test.ts accordingly to remove dead setup or
exercise the mock.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/core/metrics/tokenCounterFactory.ts`:
- Around line 11-16: The getTokenCounter function can create duplicate
TokenCounter instances when concurrent callers race because tokenCounters is
only set after awaiting tokenCounter.init(); fix by storing an in-flight
initialization promise keyed by encoding before awaiting so subsequent calls
reuse it: e.g., when getTokenCounter sees no entry in tokenCounters, immediately
create a Promise that constructs new TokenCounter(encoding), calls init(), and
resolves the instance, store that promise in the map (or a separate
tokenCounterInits map) under the encoding, await the promise, then replace the
stored promise with the resolved TokenCounter instance; update references to
TokenCounter, getTokenCounter, tokenCounters (or add tokenCounterInits)
accordingly.

In `@tests/core/metrics/TokenCounter.test.ts`:
- Around line 79-83: Update the test title to accurately describe the behavior
being asserted: change the description string for the test that constructs new
TokenCounter('o200k_base') without calling init() and expects
countTokens('test') to throw, e.g., "should throw when counting tokens if not
initialized" so it matches the actual assertion that TokenCounter.countTokens
throws when not initialized; locate the test referencing TokenCounter and
countTokens and replace the misleading title accordingly.

---

Nitpick comments:
In `@tests/core/metrics/TokenCounter.test.ts`:
- Line 4: The vi.mock('../../../src/shared/logger') call in TokenCounter.test.ts
appears unused; either remove this unused mock to keep tests clean (delete the
vi.mock(...) line) or, if you intended to suppress or assert logging, replace it
with an explicit mock usage/assertion against the shared logger export (e.g.,
spy on the logger methods used by TokenCounter) so the mock is actually
referenced; update TokenCounter.test.ts accordingly to remove dead setup or
exercise the mock.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f1847047-ff89-46a9-8073-50fa5ec8c41b

📥 Commits

Reviewing files that changed from the base of the PR and between 3e1fc1a and bb67792.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (11)
  • package.json
  • src/config/configSchema.ts
  • src/core/metrics/TokenCounter.ts
  • src/core/metrics/calculateOutputMetrics.ts
  • src/core/metrics/calculateSelectiveFileMetrics.ts
  • src/core/metrics/tokenCounterFactory.ts
  • src/core/metrics/workers/calculateMetricsWorker.ts
  • tests/core/metrics/TokenCounter.test.ts
  • tests/core/metrics/calculateMetrics.test.ts
  • tests/core/metrics/diffTokenCount.test.ts
  • tests/core/packager.test.ts

@claude

This comment has been minimized.

@yamadashy yamadashy force-pushed the perf/swap-tiktoken-to-gpt-tokenizer branch from b392c00 to bb67792 Compare March 29, 2026 02:52
@claude

This comment has been minimized.

Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token
counting, eliminating ~200ms WASM initialization overhead while keeping
the existing worker pool infrastructure for parallel processing.

Changes:
- Swap tiktoken dependency for gpt-tokenizer in package.json
- Rewrite TokenCounter to use gpt-tokenizer's async dynamic import
  with lazy-loaded encoding modules cached at module level
- Add TOKEN_ENCODINGS constant with Zod enum validation in config
  schema, replacing unsafe type assertion
- Use { disallowedSpecial: new Set() } to match tiktoken's
  encode(content, [], []) behavior (treat all text as plain text)
- Add p50k_edit encoding for backward compatibility
- Update worker to handle async getTokenCounter initialization
- Rewrite tests to use real gpt-tokenizer with exact token counts

The worker pool, parallel chunk processing, and pre-warming
infrastructure are preserved — only the underlying tokenizer changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yamadashy yamadashy force-pushed the perf/swap-tiktoken-to-gpt-tokenizer branch from bb67792 to 62b7861 Compare March 29, 2026 12:28
Rename misleading test from 'should return 0 for errors when not
initialized' to 'should throw when countTokens is called before init'
to match the actual assertion (toThrow, not toBe(0)).

Add test for cl100k_base encoding to verify the dynamic import path
works correctly for non-default encodings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 29, 2026

PR Review: perf(core): Replace tiktoken WASM with gpt-tokenizer

Overall this is a clean, focused PR. The migration is well-executed - the worker pool is preserved, tests are rewritten against the real tokenizer, and the type safety is improved with z.enum() validation. A few items worth considering:

Recommended

1. Config to Core dependency direction (src/config/configSchema.ts:2)

The project guidelines say to "maintain feature-based directory structure and avoid dependencies between features." Importing TOKEN_ENCODINGS from core/metrics/TokenCounter.js into config/configSchema.ts creates a config-to-core/metrics dependency, which inverts the typical direction (core depends on config, not the reverse). Previously, the config imported a type from the external tiktoken package, a neutral dependency.

Suggestion: Extract TOKEN_ENCODINGS and TokenEncoding to a shared location (e.g., src/shared/tokenEncodings.ts) and have both config and core/metrics import from there.

2. Base schema still accepts arbitrary strings (src/config/configSchema.ts:72)

The default schema (line 125) correctly uses z.enum(TOKEN_ENCODINGS), but the base schema (repomixConfigBaseSchema) on line 72 still uses z.string().optional() for tokenCount.encoding. This means user config files can pass any string (e.g., "banana"), which would only fail at runtime during the dynamic import.

Suggestion: Update the base schema to z.enum(TOKEN_ENCODINGS).optional() for early validation at config parse time.

3. Unsafe type assertion on dynamic import (src/core/metrics/TokenCounter.ts:31)

No runtime check that mod.countTokens exists or is a function after the dynamic import. If gpt-tokenizer changes its export shape in a future version, this silently produces undefined and later throws a confusing error.

Suggestion: Add a runtime guard before the type assertion.

Not critical but worth noting

4. Race condition in loadEncoding (TokenCounter.ts:21-38)

loadEncoding is async but does not guard against concurrent calls for the same encoding. If multiple callers invoke it simultaneously before the first completes, the import executes multiple times. This is mostly harmless (pure JS, no resource leak) but could be avoided by caching the Promise instead of the resolved value. The same applies to getTokenCounter in tokenCounterFactory.ts.

5. Map key type (TokenCounter.ts:19)

encodingModules uses Map<string, CountTokensFn> but should be Map<TokenEncoding, CountTokensFn> for type consistency.

6. Tests with hardcoded token counts (TokenCounter.test.ts)

The exact count assertions will break if gpt-tokenizer adjusts tokenization in a patch release. These encodings are standardized so it is unlikely, but worth being aware of.

Looks Good

  • The z.enum(TOKEN_ENCODINGS) validation is a clear improvement over the previous unsafe val as TiktokenEncoding cast
  • Keeping the worker pool infrastructure while only swapping the tokenizer is the right approach
  • The PLAIN_TEXT_OPTIONS constant correctly matches the old tiktoken behavior
  • Test rewrite from mocked to real tokenizer provides stronger confidence
  • Benchmark shows consistent improvement on Ubuntu/Windows, validating the WASM overhead reduction claim
  • Commit messages follow conventions, file sizes are well within limits

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant