Skip to content

perf(core): Replace tiktoken WASM with gpt-tokenizer#1343

Closed
yamadashy wants to merge 11 commits intomainfrom
perf/replace-tiktoken-with-gpt-tokenizer
Closed

perf(core): Replace tiktoken WASM with gpt-tokenizer#1343
yamadashy wants to merge 11 commits intomainfrom
perf/replace-tiktoken-with-gpt-tokenizer

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Mar 28, 2026

tiktoken uses WASM which adds ~200ms initialization overhead. gpt-tokenizer is a pure JS implementation that eliminates this cost and removes the need for a dedicated worker pool for metrics calculation. This simplifies the metrics pipeline by removing TaskRunner/worker pool infrastructure.

Benchmark improvements: Ubuntu -0.73s, Windows -0.74s, macOS -0.52s.

Changes

  • Swap tiktoken dependency for gpt-tokenizer in package.json
  • Rewrite TokenCounter to use gpt-tokenizer's encode API
  • Remove TaskRunner/worker pool infrastructure from calculateMetrics
  • Remove metricsTaskRunner pre-warming from packager pipeline
  • Update TokenEncoding type in config schema
  • Simplify all metrics calculation modules to use direct token counting

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

⚡ Performance Benchmark

Latest commit:00e2c9f perf(core): Use fast-path-with-retry for output and git token counting
Status:✅ Benchmark complete!
Ubuntu:1.95s (±0.01s) → 1.94s (±0.02s) · -0.01s (-0.3%)
macOS:1.23s (±0.14s) → 1.64s (±0.11s) · +0.41s (+33.5%)
Windows:2.51s (±0.17s) → 2.75s (±0.26s) · +0.24s (+9.7%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run
History

ee29cb6 fix(core): Use countTokensPlainText for output and worker token counting

Ubuntu:1.94s (±0.01s) → 1.95s (±0.03s) · +0.00s (+0.1%)
macOS:1.72s (±0.49s) → 2.21s (±0.74s) · +0.49s (+28.4%)
Windows:2.41s (±0.14s) → 2.56s (±0.06s) · +0.15s (+6.3%)

925eddb fix(core): Use countTokensPlainText for git metrics and strengthen test assertions

Ubuntu:1.93s (±0.02s) → 1.31s (±0.01s) · -0.62s (-32.2%)
macOS:1.03s (±0.03s) → 0.80s (±0.04s) · -0.23s (-22.4%)
Windows:2.24s (±0.06s) → 1.58s (±0.03s) · -0.66s (-29.3%)

fffc7c8 fix(core): Add p50k_edit encoding and special token fallback for file metrics

Ubuntu:1.93s (±0.01s) → 1.30s (±0.02s) · -0.63s (-32.5%)
macOS:1.08s (±0.13s) → 0.82s (±0.13s) · -0.26s (-23.8%)
Windows:2.30s (±0.14s) → 1.67s (±0.12s) · -0.64s (-27.6%)

8e30dab fix(core): Add p50k_edit encoding and special token fallback for file metrics

Ubuntu:1.96s (±0.01s) → 1.32s (±0.02s) · -0.63s (-32.3%)
macOS:1.04s (±0.06s) → 0.84s (±0.06s) · -0.20s (-19.6%)
Windows:2.42s (±0.21s) → 1.77s (±0.09s) · -0.65s (-26.9%)

187cd03 fix(core): Add p50k_edit encoding and special token fallback for file metrics

Ubuntu:1.94s (±0.02s) → 1.31s (±0.01s) · -0.64s (-32.7%)
macOS:1.59s (±0.21s) → 1.22s (±0.14s) · -0.37s (-23.1%)
Windows:3.00s (±0.23s) → 1.84s (±0.27s) · -1.16s (-38.7%)

7d25a9d perf(core): Avoid V8 deoptimization from complex catch blocks in countTokens

Ubuntu:1.96s (±0.01s) → 1.33s (±0.02s) · -0.63s (-32.2%)
macOS:1.02s (±0.06s) → 0.80s (±0.04s) · -0.23s (-22.0%)
Windows:2.42s (±0.04s) → 1.85s (±0.07s) · -0.57s (-23.5%)

c83fab1 perf(core): Use fast path without options for token counting

Ubuntu:1.95s (±0.03s) → 1.97s (±0.03s) · +0.02s (+1.0%)
macOS:1.07s (±0.08s) → 1.50s (±0.06s) · +0.43s (+40.2%)
Windows:2.22s (±0.01s) → 2.38s (±0.01s) · +0.17s (+7.4%)

7dc97ed perf(core): Use disallowedSpecial instead of two-phase try/catch for token counting

Ubuntu:2.04s (±0.02s) → 2.09s (±0.05s) · +0.04s (+2.2%)
macOS:1.03s (±0.10s) → 1.42s (±0.08s) · +0.39s (+37.8%)
Windows:2.98s (±0.09s) → 2.85s (±0.10s) · -0.12s (-4.1%)

b138e40 fix(core): Add special token fallback and fix tokenCounterFactory race condition

Ubuntu:2.02s (±0.01s) → 2.02s (±0.03s) · -0.01s (-0.3%)
macOS:1.42s (±0.25s) → 2.14s (±0.45s) · +0.72s (+51.2%)
Windows:2.43s (±0.04s) → 2.62s (±0.01s) · +0.19s (+7.8%)

5c04252 fix(core): Validate token encoding with Zod enum and remove ineffective fallback

Ubuntu:2.04s (±0.01s) → 1.42s (±0.02s) · -0.62s (-30.4%)
macOS:1.73s (±0.18s) → 1.35s (±0.17s) · -0.38s (-22.2%)
Windows:2.43s (±0.02s) → 1.82s (±0.13s) · -0.60s (-24.8%)

f151855 perf(core): Replace tiktoken WASM with gpt-tokenizer and simplify metrics pipeline

Ubuntu:2.13s (±0.01s) → 1.48s (±0.00s) · -0.66s (-30.8%)
macOS:1.11s (±0.04s) → 0.86s (±0.05s) · -0.25s (-22.3%)
Windows:2.39s (±0.03s) → 1.74s (±0.03s) · -0.65s (-27.3%)

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 85.49618% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.75%. Comparing base (bd9f343) to head (00e2c9f).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
src/core/metrics/TokenCounter.ts 68.57% 11 Missing ⚠️
src/core/metrics/calculateGitDiffMetrics.ts 85.71% 2 Missing ⚠️
src/core/metrics/calculateGitLogMetrics.ts 91.66% 1 Missing ⚠️
src/core/metrics/calculateMetrics.ts 95.65% 1 Missing ⚠️
src/core/metrics/calculateOutputMetrics.ts 85.71% 1 Missing ⚠️
src/core/metrics/calculateSelectiveFileMetrics.ts 92.85% 1 Missing ⚠️
src/core/metrics/tokenCounterFactory.ts 66.66% 1 Missing ⚠️
src/core/packager.ts 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1343      +/-   ##
==========================================
- Coverage   87.13%   86.75%   -0.39%     
==========================================
  Files         116      115       -1     
  Lines        4393     4386       -7     
  Branches     1020     1023       +3     
==========================================
- Hits         3828     3805      -23     
- Misses        565      581      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 28, 2026

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 00e2c9f
Status: ✅  Deploy successful!
Preview URL: https://7c163fe5.repomix.pages.dev
Branch Preview URL: https://perf-replace-tiktoken-with-g.repomix.pages.dev

View logs

gemini-code-assist[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fef7cb05-a66e-467f-9a1a-689982a31f4d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaced tiktoken with gpt-tokenizer library and removed worker-based task runner for token counting. Token counting is now async, runs on the main thread, and uses lazy-loaded encoding modules. All metrics calculation functions updated to use the new async token counter factory instead of task runners.

Changes

Cohort / File(s) Summary
Dependency & Config
package.json, src/config/configSchema.ts
Replaced tiktoken (^1.0.22) with gpt-tokenizer (^3.4.0) in dependencies; updated schema imports and type casts from TiktokenEncoding to TokenEncoding.
Token Counter Core
src/core/metrics/TokenCounter.ts, src/core/metrics/tokenCounterFactory.ts
Replaced sync tiktoken encoding with async lazy-loading of gpt-tokenizer/encoding/${encodingName} modules; added async init() lifecycle; exported new TokenEncoding type; getTokenCounter now returns Promise<TokenCounter>; free() changed to no-op.
Metrics Calculation
src/core/metrics/calculateGitDiffMetrics.ts, src/core/metrics/calculateGitLogMetrics.ts, src/core/metrics/calculateMetrics.ts, src/core/metrics/calculateOutputMetrics.ts, src/core/metrics/calculateSelectiveFileMetrics.ts
Removed TaskRunner/TokenCountTask dependency injection; replaced with optional getTokenCounter provider; token counting now executes sequentially on main thread instead of via worker pool; updated dependency injection patterns and logging messages.
Worker Updates
src/core/metrics/workers/calculateMetricsWorker.ts
Updated TokenCountTask.encoding type from TiktokenEncoding to TokenEncoding; made token counter acquisition async via await getTokenCounter().
Packager Integration
src/core/packager.ts
Removed createMetricsTaskRunner dependency, worker warm-up, and cleanup logic; flattened control flow to run metrics on main thread without task runner initialization.
Unit Tests
tests/core/metrics/TokenCounter.test.ts, tests/core/metrics/calculateGitDiffMetrics.test.ts, tests/core/metrics/calculateGitLogMetrics.test.ts, tests/core/metrics/calculateMetrics.test.ts, tests/core/metrics/calculateOutputMetrics.test.ts, tests/core/metrics/calculateSelectiveFileMetrics.test.ts, tests/core/metrics/diffTokenCount.test.ts, tests/core/packager.test.ts, tests/core/packager/diffsFunctionality.test.ts, tests/core/packager/splitOutput.test.ts, tests/integration-tests/packager.test.ts
Removed mocks of createMetricsTaskRunner and worker-based token counting; replaced with real or mocked TokenCounter instances; updated assertions to verify main-thread execution and looser token count checks instead of exact values and worker call assertions.

Sequence Diagram

sequenceDiagram
    participant Packager as Packager
    participant CalcMetrics as calculateMetrics
    participant CalcFiles as calculateSelectiveFileMetrics
    participant CalcGit as calculateGitDiffMetrics<br/>(and calculateGitLogMetrics)
    participant CalcOutput as calculateOutputMetrics
    participant Factory as getTokenCounter Factory
    participant Counter as TokenCounter

    Packager->>CalcMetrics: calculateMetrics()
    CalcMetrics->>CalcFiles: calculateSelectiveFileMetrics()
    CalcFiles->>Factory: await getTokenCounter(encoding)
    Factory->>Counter: new TokenCounter(encoding)
    Counter->>Counter: await init() [lazy-load gpt-tokenizer module]
    Factory-->>CalcFiles: initialized TokenCounter
    CalcFiles->>Counter: countTokens(content)
    Counter-->>CalcFiles: token count
    CalcFiles-->>CalcMetrics: fileTokenCounts
    
    par Parallel Execution
        CalcMetrics->>CalcOutput: calculateOutputMetrics()
        CalcOutput->>Factory: await getTokenCounter(encoding)
        Factory-->>CalcOutput: TokenCounter
        CalcOutput->>Counter: countTokens(content)
        Counter-->>CalcOutput: token count
        
        CalcMetrics->>CalcGit: calculateGitDiffMetrics()
        CalcGit->>Factory: await getTokenCounter(encoding)
        Factory-->>CalcGit: TokenCounter
        CalcGit->>Counter: countTokens(content)
        Counter-->>CalcGit: token count
    end
    
    CalcMetrics-->>Packager: metrics result (main thread, no workers)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #1302 — This PR directly undoes and replaces the worker-pool and tiktoken-based token counting approach introduced in that PR with an async gpt-tokenizer-based main-thread solution.
  • PR #260 — Modifies token-count encoding configuration, schema types, and TokenCounter encoding type usage across the same metrics files.
  • PR #851 — Modifies the core metrics token-counting pipeline and worker/task-runner approach, affecting the same files like calculateGitDiffMetrics, calculateSelectiveFileMetrics, and worker modules.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: replacing tiktoken with gpt-tokenizer for performance improvement, which is the primary objective of the entire changeset.
Description check ✅ Passed The PR description follows the template, includes a clear summary of changes, lists specific modifications, provides benchmark data, and has both checklist items marked as completed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/replace-tiktoken-with-gpt-tokenizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude

This comment has been minimized.

coderabbitai[bot]

This comment was marked as resolved.

@claude

This comment has been minimized.

devin-ai-integration[bot]

This comment was marked as resolved.

@claude

This comment has been minimized.

@claude

This comment has been minimized.

@yamadashy yamadashy force-pushed the perf/replace-tiktoken-with-gpt-tokenizer branch from b138e40 to a0610c9 Compare March 28, 2026 11:56
@claude

This comment has been minimized.

devin-ai-integration[bot]

This comment was marked as resolved.

@claude

This comment has been minimized.

devin-ai-integration[bot]

This comment was marked as resolved.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 28, 2026

🔍 Code Review — perf(core): Replace tiktoken WASM with gpt-tokenizer

Great performance win! The architecture simplification (removing worker pool, moving to pure JS tokenizer) is well-motivated and the benchmark numbers are compelling (~0.5–0.7s across platforms). A few findings below, ordered by severity.


1. countTokensPlainText() is dead code — special token files silently return 0 tokens

countTokens() calls this.countFn(content) with no options. If gpt-tokenizer's default mode throws on special token sequences like <|endoftext|>, the catch block returns 0. The countTokensPlainText() method exists to handle this exact case (using FALLBACK_OPTIONS), but it is never called anywhere in the codebase.

I see from the commit history that this was deliberate — a V8 deoptimization concern with complex catch blocks. However, the current state means files containing special token patterns (~0.1% of repos per the comment) will silently report 0 tokens rather than an approximate count. This is a behavioral regression from tiktoken, which always used encode(content, [], []).

Suggestion

Consider having callers (e.g., calculateSelectiveFileMetrics) call countTokensPlainText() as a retry when countTokens() returns 0 for non-empty content. This keeps the hot path clean while still handling the edge case:

let tokenCount = counter.countTokens(file.content, file.path);
if (tokenCount === 0 && file.content.length > 0) {
  tokenCount = counter.countTokensPlainText(file.content, file.path);
}

2. Race condition in getTokenCounter() — deliberately kept but worth documenting

The check-then-await-then-set pattern in tokenCounterFactory.ts can create duplicate TokenCounter instances under concurrent calls. The commit history shows this was fixed and then reverted (commit 7d25a9d) because "getTokenCounter is effectively serialized by the pipeline."

This is a reasonable tradeoff, but calculateMetrics.ts does run output/git metrics in parallel via Promise.all, all calling getTokenCounter. Since the counter is already warm from the earlier calculateSelectiveFileMetrics call, the race window is practically closed. A brief comment in tokenCounterFactory.ts documenting this design decision would prevent future contributors from re-discovering and re-fixing this.


3. calculateMetricsWorker.ts is now dead code

The worker file was updated to use the new async getTokenCounter API, but no caller dispatches to it anymore — the TaskRunner/worker pool was removed. The worker type 'calculateMetrics' likely still exists in processConcurrency.ts and unifiedWorker.ts. Consider either removing the dead worker infrastructure or adding a comment explaining it's retained intentionally.


4. TOKEN_ENCODINGS creates a cross-feature dependency

src/config/configSchema.ts imports TOKEN_ENCODINGS from src/core/metrics/TokenCounter.ts. Per CLAUDE.md, the project should "avoid dependencies between features." The config layer now depends on a core/metrics implementation detail. Moving TOKEN_ENCODINGS to src/shared/ or defining it directly in configSchema.ts would fix this.


5. Test assertions are weak — toBeGreaterThan(0) everywhere

Nearly all non-empty token count assertions changed from exact values to > 0. A broken implementation returning 1 for all inputs would pass the full test suite. Consider pinning a few stable exact counts (e.g., 'Hello, world!' → expected N tokens with o200k_base) as regression anchors. This was also flagged by CodeRabbit.


6. Minor: Error context dropped in catch blocks

} catch {   // error variable not bound
  logger.warn(`Failed to count tokens. path: ${filePath}`);
  return 0;
}

The error object is silently discarded. This is intentional for V8 optimization, but logging the error at trace level would preserve debuggability without affecting the hot path optimization.


Additional minor observations
  • freeTokenCounters() doesn't clear encodingModules Map in TokenCounter.ts. For CLI usage this is fine, but for MCP server (long-running process), encoding modules persist forever. Consider clearing both maps in freeTokenCounters().
  • { ...defaultDeps, ...deps } spread happens on every metrics function call even when deps is empty (the common production case). Minor allocation overhead.
  • No test for non-default encodings (cl100k_base, p50k_base, r50k_base). All tests use o200k_base.

Overall this is a solid performance improvement with clean architecture simplification. The main concern is the silent 0-token regression for files with special token sequences (#1 above). The rest are minor improvements that could be addressed in follow-up PRs.

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 28, 2026

Code Review — perf(core): Replace tiktoken WASM with gpt-tokenizer

Good iteration since the last review — the special token fallback in calculateSelectiveFileMetrics (commit 187cd03) addresses the biggest concern. A few new/remaining findings below.


1. countTokensPlainText fallback missing in output/git metrics paths

The fallback (countTokenscountTokensPlainText when result is 0 for non-empty content) was added to calculateSelectiveFileMetrics but not to calculateOutputMetrics, calculateGitDiffMetrics, or calculateGitLogMetrics. The packed output embeds all file contents, so if any file contains special token sequences like <|endoftext|>, the entire output token count could silently return 0 — while per-file counts correctly fall back. This would produce confusing results where totalTokens is 0 but individual file token counts are non-zero.

Consider using countTokensPlainText directly in these callers (since the comment notes it's actually faster — it skips the special token regex scan), or applying the same fallback pattern.


2. JSDoc references non-existent method countTokensSafe()

In src/core/metrics/TokenCounter.ts, the JSDoc on countTokens() says:

Use countTokensSafe() to handle all files correctly.

The actual method is countTokensPlainText(). Stale reference that will confuse callers.


3. No test coverage for countTokensPlainText() or the fallback path

Details
  • countTokensPlainText() is a new public method with no unit test in TokenCounter.test.ts
  • The fallback path in calculateSelectiveFileMetrics (lines 37–39: retry with countTokensPlainText when countTokens returns 0 for non-empty content) has no test either
  • A test with content containing <|endoftext|> would exercise both paths and serve as a regression anchor for the core behavioral change in this PR

4. Asymmetric error handling: git diff rethrows, git log swallows

calculateGitDiffMetrics rethrows errors (the catch block logs then throw error), while calculateGitLogMetrics catches and returns { gitLogTokenCount: 0 }. Both are called from Promise.all in calculateMetrics. If the git diff path throws, the entire metrics calculation fails — but the same error in git logs is silently swallowed. This asymmetry looks unintentional.


Previously flagged (still open)

Items 3–6 from the previous review are still applicable:

  • Dead worker code (calculateMetricsWorker.ts + references in processConcurrency.ts/unifiedWorker.ts)
  • Cross-feature dependency (configcore/metrics for TOKEN_ENCODINGS)
  • Weak test assertions (toBeGreaterThan(0) everywhere)
  • Error context dropped in catch blocks

Overall this is well-architected and the benchmark improvements are solid. Items 1 and 2 above are the most actionable for this PR; the rest could be follow-ups.

🤖 Generated with Claude Code

devin-ai-integration[bot]

This comment was marked as resolved.

yamadashy and others added 6 commits March 29, 2026 00:01
…rics pipeline

Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token
counting. This eliminates ~200ms WASM initialization overhead and removes
the need for a dedicated worker pool for metrics calculation.

- Swap tiktoken dependency for gpt-tokenizer in package.json
- Rewrite TokenCounter to use gpt-tokenizer's encode API
- Remove TaskRunner/worker pool infrastructure from calculateMetrics
- Remove metricsTaskRunner pre-warming from packager pipeline
- Update TokenEncoding type in config schema
- Simplify all metrics calculation modules to use direct token counting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ve fallback

Replace unsafe type assertion (`val as TokenEncoding`) in config schema
with Zod `.enum()` validation using a shared `TOKEN_ENCODINGS` constant.
This prevents arbitrary strings from reaching the dynamic import in
`TokenCounter.loadEncoding()`.

Also remove the ineffective fallback in `TokenCounter.countTokens()` that
re-called the same cached function after it already threw, making the
retry logic dead code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e condition

Restore special token handling that was lost in the tiktoken-to-gpt-tokenizer
migration. Files containing sequences like <|endoftext|> now correctly fall
back to countTokens with allowedSpecial: 'all' instead of returning 0.

Also memoize in-flight initialization promises in getTokenCounter() to
prevent duplicate TokenCounter instances when called concurrently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…token counting

Replace the two-phase approach (fast path without options → fallback with
allowedSpecial: 'all') with a single call using { disallowedSpecial: new Set() }.

The previous approach caused benchmark regressions (macOS +51%, Windows +7.8%)
because gpt-tokenizer's default disallowedSpecial: 'all' throws on any text
matching special token patterns, triggering the costly fallback on many files.

Additionally, allowedSpecial: 'all' had incorrect semantics — it counted
<|endoftext|> as 1 control token instead of 7 text tokens, diverging from
the old tiktoken behavior. Using { disallowedSpecial: new Set() } treats
all content as plain text, matching tiktoken's encode(content, [], []).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid passing options to gpt-tokenizer's countTokens() on every call.
When options are provided, gpt-tokenizer calls processSpecialTokens()
each time instead of using its pre-cached defaultSpecialTokenConfig,
adding significant per-call overhead.

Use the no-options fast path by default, and fall back to
{ disallowedSpecial: new Set() } only for the rare files (~0.1%)
that contain special token sequences like <|endoftext|>.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tTokens

V8 deoptimizes the entire try block when the catch block contains
non-trivial logic (method calls, nested try/catch). Local benchmarks
showed +25% regression (~1050ms → ~1330ms) from adding a fallback
retry in the catch path.

Keep countTokens() catch block minimal (log + return 0) to match the
structure V8 optimizes well. Provide special token handling as a
separate countTokensPlainText() method that callers can use when they
know content contains special token sequences.

Also revert tokenCounterFactory race condition fix — the IIFE-based
Promise memoization added complexity without measurable benefit since
getTokenCounter is effectively serialized by the pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yamadashy yamadashy force-pushed the perf/replace-tiktoken-with-gpt-tokenizer branch from 187cd03 to 8e30dab Compare March 28, 2026 15:02
… metrics

Add p50k_edit to TOKEN_ENCODINGS for backward compatibility with users
who had this tiktoken encoding in their config.

Add countTokensPlainText() method that uses { disallowedSpecial: new Set() }
to match tiktoken's encode(content, [], []) behavior, treating special
tokens like <|endoftext|> as ordinary text.

In the file metrics hot loop, retry with countTokensPlainText() when
countTokens() returns 0 for non-empty files, handling the rare (~0.1%)
files containing special token sequences without affecting hot path
performance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

@yamadashy yamadashy force-pushed the perf/replace-tiktoken-with-gpt-tokenizer branch from 8e30dab to fffc7c8 Compare March 28, 2026 15:09
…st assertions

Use countTokensPlainText() for git diff and git log token counting to
correctly handle special token sequences in diffs. These are cold paths
(1-2 calls each) so the processSpecialTokens overhead is negligible.

Strengthen test assertions from toBeGreaterThan(0) to exact token counts
verified against gpt-tokenizer's o200k_base encoding. This catches
encoding regressions and tokenizer implementation drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The output content often contains special token sequences like
<|endoftext|> from packed source files (e.g., TokenCounter.ts comments,
release notes). Using countTokens() on this content causes gpt-tokenizer
to throw, silently returning 0 for the entire output — making Total
Tokens show as 0 in the summary.

Switch calculateOutputMetrics and calculateMetricsWorker to use
countTokensPlainText() which matches tiktoken's encode(content, [], [])
behavior, treating all content as ordinary text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 28, 2026

Code Review — perf(core): Replace tiktoken WASM with gpt-tokenizer

Nice performance win! The removal of WASM initialization overhead and worker pool complexity is a solid simplification. A few observations worth considering:

Findings

1. Potentially dead code: calculateMetricsWorker.ts and related infrastructure
Severity: Medium

The PR removes all TaskRunner usage for metrics from the main code path, but calculateMetricsWorker.ts was updated (not removed). It's still registered in src/shared/unifiedWorker.ts (line 52-55) as workerType: 'calculateMetrics' and referenced in inferWorkerTypeFromTask. Since no code creates a TaskRunner with this worker type anymore, this is dead code. Consider removing or adding a TODO.

2. Inconsistent use of countTokens vs countTokensPlainText
Severity: Low

In calculateSelectiveFileMetrics.ts, there's a two-phase approach: try countTokens() first, fall back to countTokensPlainText() if result is 0. But calculateOutputMetrics, calculateGitDiffMetrics, calculateGitLogMetrics, and the worker all use only countTokensPlainText().

The comment in TokenCounter.ts says countTokens() is the "fast path" that may fail on special tokens. If countTokensPlainText() is also described as "faster than the default" (line 16-17), it's unclear what performance benefit the two-phase approach provides for file metrics. Consider either using countTokensPlainText() consistently everywhere, or documenting why the two-phase approach is needed only for file metrics.

3. Two-level caching may be redundant
Severity: Low

There are two caches: encodingModules (module-level Map in TokenCounter.ts) and tokenCounters (module-level Map in tokenCounterFactory.ts). Since TokenCounter.init() calls loadEncoding() which checks encodingModules, and the factory caches TokenCounter instances, the encoding module cache is effectively redundant when going through the factory. Not a bug, but worth noting for clarity.

4. Sequential file metrics processing
Severity: Info

calculateSelectiveFileMetrics switched from parallel Promise.all with worker threads to a sequential for loop on the main thread. The benchmarks show this is faster for typical repos (WASM overhead > parallelism benefit), but for very large repos with thousands of files, this blocks the main thread. The benchmarks validate the current approach, so this is fine as-is — just worth monitoring if users report slowness on very large codebases.


Overall this is a clean, well-structured PR that achieves meaningful perf improvements through simplification. The main actionable item is cleaning up the now-unused calculateMetricsWorker.ts and its registration in unifiedWorker.ts.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 12 additional findings in Devin Review.

Open in Devin Review

Comment on lines +12 to 19
export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => {
let tokenCounter = tokenCounters.get(encoding);
if (!tokenCounter) {
tokenCounter = new TokenCounter(encoding);
await tokenCounter.init();
tokenCounters.set(encoding, tokenCounter);
}
return tokenCounter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Race condition in async getTokenCounter allows duplicate TokenCounter creation

The getTokenCounter function has a classic TOCTOU (time-of-check-time-of-use) race condition. It checks the map, then awaits init(), then sets the map. If two concurrent callers request the same encoding before either completes, both see the map as empty, both create and initialize a TokenCounter, and the second one overwrites the first in the map. The first caller gets a TokenCounter instance that is not stored in the cache.

The comment in calculateMetrics.ts:58-59 acknowledges sequential ordering is needed ("File metrics must run first... then output/git metrics can run in parallel since they share the cached TokenCounter"), but if calculateSelectiveFileMetrics returns early at calculateSelectiveFileMetrics.ts:24-26 (when filesToProcess.length === 0), the counter is never cached and the subsequent Promise.all at calculateMetrics.ts:67 triggers concurrent calls to getTokenCounter from calculateOutputMetrics, calculateGitDiffMetrics, and calculateGitLogMetrics.

The fix is to cache the Promise<TokenCounter> instead of the resolved value, ensuring all concurrent callers share the same initialization promise.

Suggested change
export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => {
let tokenCounter = tokenCounters.get(encoding);
if (!tokenCounter) {
tokenCounter = new TokenCounter(encoding);
await tokenCounter.init();
tokenCounters.set(encoding, tokenCounter);
}
return tokenCounter;
export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => {
let pending = tokenCounterPromises.get(encoding);
if (!pending) {
pending = (async () => {
const tokenCounter = new TokenCounter(encoding);
await tokenCounter.init();
tokenCounters.set(encoding, tokenCounter);
return tokenCounter;
})();
tokenCounterPromises.set(encoding, pending);
}
return pending;
};
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

yamadashy and others added 2 commits March 29, 2026 00:28
The calculateMetricsWorker and its 'calculateMetrics' worker type are
no longer used — all token counting now runs on the main thread via
gpt-tokenizer. Remove the worker file, its case branches in
processConcurrency and unifiedWorker, and the associated tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the same retry pattern as file metrics: try countTokens() first
(fast path, uses cached defaultSpecialTokenConfig), and only fall back
to countTokensPlainText() when the result is 0 for non-empty content.

This avoids processSpecialTokens() overhead for repos without special
token sequences while still correctly counting tokens when they appear.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yamadashy yamadashy marked this pull request as draft March 28, 2026 15:49
@yamadashy
Copy link
Copy Markdown
Owner Author

yamadashy commented Mar 28, 2026

Background and Change of Approach

This PR replaced tiktoken (WASM) with gpt-tokenizer (pure JS) and removed the worker pool to run token counting directly on the main thread.

Issues Discovered

  1. Output token count silently becomes 0: Repomix's own source code (TokenCounter.ts comments) contains <|endoftext|>, which causes gpt-tokenizer's default behavior (disallowedSpecial: 'all') to throw — making Total Tokens report as 0
  2. countTokensPlainText() overhead: Passing { disallowedSpecial: new Set() } forces gpt-tokenizer to call processSpecialTokens() on every invocation. For the ~3.7MB output content, this adds significant latency
  3. V8 inlining issue: Adding retry logic (calling countTokensPlainText()) inside the countTokens() catch block inflates bytecode size beyond V8 TurboFan's inlining budget, causing +25% performance regression. Modern V8 doesn't "deoptimize try blocks" (that was Crankshaft-era), but catch block complexity still affects inlining decisions and generated code size
  4. Loss of worker parallelism: Sequential main-thread processing performs comparably on Ubuntu, but regresses on macOS (+33%) and Windows (+10%) where worker pool parallelism provided more benefit

New Approach

Based on these findings, the optimal approach is to keep the worker pool and only swap the tokenizer library. Created #1350 with this approach:

  • Use { disallowedSpecial: new Set() } in workers (no V8 inlining impact in worker context)
  • Preserve parallel chunk processing and pre-warming infrastructure
  • Only eliminate WASM initialization overhead (~200ms)

Learnings from this PR (Zod enum validation, TOKEN_ENCODINGS constant, p50k_edit support, exact token count test assertions) are carried forward into #1350.

@yamadashy yamadashy closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant