perf(core): Replace tiktoken WASM with gpt-tokenizer by yamadashy · Pull Request #1343 · yamadashy/repomix

yamadashy · 2026-03-28T09:54:10Z

tiktoken uses WASM which adds ~200ms initialization overhead. gpt-tokenizer is a pure JS implementation that eliminates this cost and removes the need for a dedicated worker pool for metrics calculation. This simplifies the metrics pipeline by removing TaskRunner/worker pool infrastructure.

Benchmark improvements: Ubuntu -0.73s, Windows -0.74s, macOS -0.52s.

Changes

Swap tiktoken dependency for gpt-tokenizer in package.json
Rewrite TokenCounter to use gpt-tokenizer's encode API
Remove TaskRunner/worker pool infrastructure from calculateMetrics
Remove metricsTaskRunner pre-warming from packager pipeline
Update TokenEncoding type in config schema
Simplify all metrics calculation modules to use direct token counting

Checklist

Run npm run test
Run npm run lint

github-actions · 2026-03-28T09:54:32Z

⚡ Performance Benchmark

Latest commit:	`00e2c9f` perf(core): Use fast-path-with-retry for output and git token counting
Status:	✅ Benchmark complete!
Ubuntu:	1.95s (±0.01s) → 1.94s (±0.02s) · -0.01s (-0.3%)
macOS:	1.23s (±0.14s) → 1.64s (±0.11s) · +0.41s (+33.5%)
Windows:	2.51s (±0.17s) → 2.75s (±0.26s) · +0.24s (+9.7%)

Details

Packing the repomix repository with node bin/repomix.cjs
Warmup: 2 runs (discarded), interleaved execution
Measurement: 20 runs / 30 on macOS (median ± IQR)
Workflow run

History

ee29cb6 fix(core): Use countTokensPlainText for output and worker token counting

Ubuntu:	1.94s (±0.01s) → 1.95s (±0.03s) · +0.00s (+0.1%)
macOS:	1.72s (±0.49s) → 2.21s (±0.74s) · +0.49s (+28.4%)
Windows:	2.41s (±0.14s) → 2.56s (±0.06s) · +0.15s (+6.3%)

925eddb fix(core): Use countTokensPlainText for git metrics and strengthen test assertions

Ubuntu:	1.93s (±0.02s) → 1.31s (±0.01s) · -0.62s (-32.2%)
macOS:	1.03s (±0.03s) → 0.80s (±0.04s) · -0.23s (-22.4%)
Windows:	2.24s (±0.06s) → 1.58s (±0.03s) · -0.66s (-29.3%)

fffc7c8 fix(core): Add p50k_edit encoding and special token fallback for file metrics

Ubuntu:	1.93s (±0.01s) → 1.30s (±0.02s) · -0.63s (-32.5%)
macOS:	1.08s (±0.13s) → 0.82s (±0.13s) · -0.26s (-23.8%)
Windows:	2.30s (±0.14s) → 1.67s (±0.12s) · -0.64s (-27.6%)

8e30dab fix(core): Add p50k_edit encoding and special token fallback for file metrics

Ubuntu:	1.96s (±0.01s) → 1.32s (±0.02s) · -0.63s (-32.3%)
macOS:	1.04s (±0.06s) → 0.84s (±0.06s) · -0.20s (-19.6%)
Windows:	2.42s (±0.21s) → 1.77s (±0.09s) · -0.65s (-26.9%)

187cd03 fix(core): Add p50k_edit encoding and special token fallback for file metrics

Ubuntu:	1.94s (±0.02s) → 1.31s (±0.01s) · -0.64s (-32.7%)
macOS:	1.59s (±0.21s) → 1.22s (±0.14s) · -0.37s (-23.1%)
Windows:	3.00s (±0.23s) → 1.84s (±0.27s) · -1.16s (-38.7%)

7d25a9d perf(core): Avoid V8 deoptimization from complex catch blocks in countTokens

Ubuntu:	1.96s (±0.01s) → 1.33s (±0.02s) · -0.63s (-32.2%)
macOS:	1.02s (±0.06s) → 0.80s (±0.04s) · -0.23s (-22.0%)
Windows:	2.42s (±0.04s) → 1.85s (±0.07s) · -0.57s (-23.5%)

c83fab1 perf(core): Use fast path without options for token counting

Ubuntu:	1.95s (±0.03s) → 1.97s (±0.03s) · +0.02s (+1.0%)
macOS:	1.07s (±0.08s) → 1.50s (±0.06s) · +0.43s (+40.2%)
Windows:	2.22s (±0.01s) → 2.38s (±0.01s) · +0.17s (+7.4%)

7dc97ed perf(core): Use disallowedSpecial instead of two-phase try/catch for token counting

Ubuntu:	2.04s (±0.02s) → 2.09s (±0.05s) · +0.04s (+2.2%)
macOS:	1.03s (±0.10s) → 1.42s (±0.08s) · +0.39s (+37.8%)
Windows:	2.98s (±0.09s) → 2.85s (±0.10s) · -0.12s (-4.1%)

b138e40 fix(core): Add special token fallback and fix tokenCounterFactory race condition

Ubuntu:	2.02s (±0.01s) → 2.02s (±0.03s) · -0.01s (-0.3%)
macOS:	1.42s (±0.25s) → 2.14s (±0.45s) · +0.72s (+51.2%)
Windows:	2.43s (±0.04s) → 2.62s (±0.01s) · +0.19s (+7.8%)

5c04252 fix(core): Validate token encoding with Zod enum and remove ineffective fallback

Ubuntu:	2.04s (±0.01s) → 1.42s (±0.02s) · -0.62s (-30.4%)
macOS:	1.73s (±0.18s) → 1.35s (±0.17s) · -0.38s (-22.2%)
Windows:	2.43s (±0.02s) → 1.82s (±0.13s) · -0.60s (-24.8%)

f151855 perf(core): Replace tiktoken WASM with gpt-tokenizer and simplify metrics pipeline

Ubuntu:	2.13s (±0.01s) → 1.48s (±0.00s) · -0.66s (-30.8%)
macOS:	1.11s (±0.04s) → 0.86s (±0.05s) · -0.25s (-22.3%)
Windows:	2.39s (±0.03s) → 1.74s (±0.03s) · -0.65s (-27.3%)

codecov · 2026-03-28T09:55:07Z

Codecov Report

❌ Patch coverage is 85.49618% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.75%. Comparing base (bd9f343) to head (00e2c9f).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
src/core/metrics/TokenCounter.ts	68.57%	11 Missing ⚠️
src/core/metrics/calculateGitDiffMetrics.ts	85.71%	2 Missing ⚠️
src/core/metrics/calculateGitLogMetrics.ts	91.66%	1 Missing ⚠️
src/core/metrics/calculateMetrics.ts	95.65%	1 Missing ⚠️
src/core/metrics/calculateOutputMetrics.ts	85.71%	1 Missing ⚠️
src/core/metrics/calculateSelectiveFileMetrics.ts	92.85%	1 Missing ⚠️
src/core/metrics/tokenCounterFactory.ts	66.66%	1 Missing ⚠️
src/core/packager.ts	95.65%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1343      +/-   ##
==========================================
- Coverage   87.13%   86.75%   -0.39%     
==========================================
  Files         116      115       -1     
  Lines        4393     4386       -7     
  Branches     1020     1023       +3     
==========================================
- Hits         3828     3805      -23     
- Misses        565      581      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cloudflare-workers-and-pages · 2026-03-28T09:55:49Z

Deploying repomix with Cloudflare Pages

Latest commit:	`00e2c9f`
Status:	✅ Deploy successful!
Preview URL:	https://7c163fe5.repomix.pages.dev
Branch Preview URL:	https://perf-replace-tiktoken-with-g.repomix.pages.dev

View logs

coderabbitai · 2026-03-28T10:00:27Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fef7cb05-a66e-467f-9a1a-689982a31f4d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaced tiktoken with gpt-tokenizer library and removed worker-based task runner for token counting. Token counting is now async, runs on the main thread, and uses lazy-loaded encoding modules. All metrics calculation functions updated to use the new async token counter factory instead of task runners.

Changes

Cohort / File(s)	Summary
Dependency & Config `package.json`, `src/config/configSchema.ts`	Replaced `tiktoken` (`^1.0.22`) with `gpt-tokenizer` (`^3.4.0`) in dependencies; updated schema imports and type casts from `TiktokenEncoding` to `TokenEncoding`.
Token Counter Core `src/core/metrics/TokenCounter.ts`, `src/core/metrics/tokenCounterFactory.ts`	Replaced sync `tiktoken` encoding with async lazy-loading of `gpt-tokenizer/encoding/${encodingName}` modules; added `async init()` lifecycle; exported new `TokenEncoding` type; `getTokenCounter` now returns `Promise<TokenCounter>`; `free()` changed to no-op.
Metrics Calculation `src/core/metrics/calculateGitDiffMetrics.ts`, `src/core/metrics/calculateGitLogMetrics.ts`, `src/core/metrics/calculateMetrics.ts`, `src/core/metrics/calculateOutputMetrics.ts`, `src/core/metrics/calculateSelectiveFileMetrics.ts`	Removed `TaskRunner`/`TokenCountTask` dependency injection; replaced with optional `getTokenCounter` provider; token counting now executes sequentially on main thread instead of via worker pool; updated dependency injection patterns and logging messages.
Worker Updates `src/core/metrics/workers/calculateMetricsWorker.ts`	Updated `TokenCountTask.encoding` type from `TiktokenEncoding` to `TokenEncoding`; made token counter acquisition async via `await getTokenCounter()`.
Packager Integration `src/core/packager.ts`	Removed `createMetricsTaskRunner` dependency, worker warm-up, and cleanup logic; flattened control flow to run metrics on main thread without task runner initialization.
Unit Tests `tests/core/metrics/TokenCounter.test.ts`, `tests/core/metrics/calculateGitDiffMetrics.test.ts`, `tests/core/metrics/calculateGitLogMetrics.test.ts`, `tests/core/metrics/calculateMetrics.test.ts`, `tests/core/metrics/calculateOutputMetrics.test.ts`, `tests/core/metrics/calculateSelectiveFileMetrics.test.ts`, `tests/core/metrics/diffTokenCount.test.ts`, `tests/core/packager.test.ts`, `tests/core/packager/diffsFunctionality.test.ts`, `tests/core/packager/splitOutput.test.ts`, `tests/integration-tests/packager.test.ts`	Removed mocks of `createMetricsTaskRunner` and worker-based token counting; replaced with real or mocked `TokenCounter` instances; updated assertions to verify main-thread execution and looser token count checks instead of exact values and worker call assertions.

Sequence Diagram

sequenceDiagram
    participant Packager as Packager
    participant CalcMetrics as calculateMetrics
    participant CalcFiles as calculateSelectiveFileMetrics
    participant CalcGit as calculateGitDiffMetrics<br/>(and calculateGitLogMetrics)
    participant CalcOutput as calculateOutputMetrics
    participant Factory as getTokenCounter Factory
    participant Counter as TokenCounter

    Packager->>CalcMetrics: calculateMetrics()
    CalcMetrics->>CalcFiles: calculateSelectiveFileMetrics()
    CalcFiles->>Factory: await getTokenCounter(encoding)
    Factory->>Counter: new TokenCounter(encoding)
    Counter->>Counter: await init() [lazy-load gpt-tokenizer module]
    Factory-->>CalcFiles: initialized TokenCounter
    CalcFiles->>Counter: countTokens(content)
    Counter-->>CalcFiles: token count
    CalcFiles-->>CalcMetrics: fileTokenCounts
    
    par Parallel Execution
        CalcMetrics->>CalcOutput: calculateOutputMetrics()
        CalcOutput->>Factory: await getTokenCounter(encoding)
        Factory-->>CalcOutput: TokenCounter
        CalcOutput->>Counter: countTokens(content)
        Counter-->>CalcOutput: token count
        
        CalcMetrics->>CalcGit: calculateGitDiffMetrics()
        CalcGit->>Factory: await getTokenCounter(encoding)
        Factory-->>CalcGit: TokenCounter
        CalcGit->>Counter: countTokens(content)
        Counter-->>CalcGit: token count
    end
    
    CalcMetrics-->>Packager: metrics result (main thread, no workers)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #1302 — This PR directly undoes and replaces the worker-pool and tiktoken-based token counting approach introduced in that PR with an async gpt-tokenizer-based main-thread solution.
PR #260 — Modifies token-count encoding configuration, schema types, and TokenCounter encoding type usage across the same metrics files.
PR #851 — Modifies the core metrics token-counting pipeline and worker/task-runner approach, affecting the same files like calculateGitDiffMetrics, calculateSelectiveFileMetrics, and worker modules.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: replacing tiktoken with gpt-tokenizer for performance improvement, which is the primary objective of the entire changeset.
Description check	✅ Passed	The PR description follows the template, includes a clear summary of changes, lists specific modifications, provides benchmark data, and has both checklist items marked as completed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/replace-tiktoken-with-gpt-tokenizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-03-28T13:39:38Z

🔍 Code Review — `perf(core): Replace tiktoken WASM with gpt-tokenizer`

Great performance win! The architecture simplification (removing worker pool, moving to pure JS tokenizer) is well-motivated and the benchmark numbers are compelling (~0.5–0.7s across platforms). A few findings below, ordered by severity.

1. `countTokensPlainText()` is dead code — special token files silently return 0 tokens

countTokens() calls this.countFn(content) with no options. If gpt-tokenizer's default mode throws on special token sequences like <|endoftext|>, the catch block returns 0. The countTokensPlainText() method exists to handle this exact case (using FALLBACK_OPTIONS), but it is never called anywhere in the codebase.

I see from the commit history that this was deliberate — a V8 deoptimization concern with complex catch blocks. However, the current state means files containing special token patterns (~0.1% of repos per the comment) will silently report 0 tokens rather than an approximate count. This is a behavioral regression from tiktoken, which always used encode(content, [], []).

Suggestion

Consider having callers (e.g., calculateSelectiveFileMetrics) call countTokensPlainText() as a retry when countTokens() returns 0 for non-empty content. This keeps the hot path clean while still handling the edge case:

let tokenCount = counter.countTokens(file.content, file.path);
if (tokenCount === 0 && file.content.length > 0) {
  tokenCount = counter.countTokensPlainText(file.content, file.path);
}

2. Race condition in `getTokenCounter()` — deliberately kept but worth documenting

The check-then-await-then-set pattern in tokenCounterFactory.ts can create duplicate TokenCounter instances under concurrent calls. The commit history shows this was fixed and then reverted (commit 7d25a9d) because "getTokenCounter is effectively serialized by the pipeline."

This is a reasonable tradeoff, but calculateMetrics.ts does run output/git metrics in parallel via Promise.all, all calling getTokenCounter. Since the counter is already warm from the earlier calculateSelectiveFileMetrics call, the race window is practically closed. A brief comment in tokenCounterFactory.ts documenting this design decision would prevent future contributors from re-discovering and re-fixing this.

3. `calculateMetricsWorker.ts` is now dead code

The worker file was updated to use the new async getTokenCounter API, but no caller dispatches to it anymore — the TaskRunner/worker pool was removed. The worker type 'calculateMetrics' likely still exists in processConcurrency.ts and unifiedWorker.ts. Consider either removing the dead worker infrastructure or adding a comment explaining it's retained intentionally.

4. `TOKEN_ENCODINGS` creates a cross-feature dependency

src/config/configSchema.ts imports TOKEN_ENCODINGS from src/core/metrics/TokenCounter.ts. Per CLAUDE.md, the project should "avoid dependencies between features." The config layer now depends on a core/metrics implementation detail. Moving TOKEN_ENCODINGS to src/shared/ or defining it directly in configSchema.ts would fix this.

5. Test assertions are weak — `toBeGreaterThan(0)` everywhere

Nearly all non-empty token count assertions changed from exact values to > 0. A broken implementation returning 1 for all inputs would pass the full test suite. Consider pinning a few stable exact counts (e.g., 'Hello, world!' → expected N tokens with o200k_base) as regression anchors. This was also flagged by CodeRabbit.

6. Minor: Error context dropped in catch blocks

} catch {   // error variable not bound
  logger.warn(`Failed to count tokens. path: ${filePath}`);
  return 0;
}

The error object is silently discarded. This is intentional for V8 optimization, but logging the error at trace level would preserve debuggability without affecting the hot path optimization.

Additional minor observations

freeTokenCounters() doesn't clear encodingModules Map in TokenCounter.ts. For CLI usage this is fine, but for MCP server (long-running process), encoding modules persist forever. Consider clearing both maps in freeTokenCounters().
{ ...defaultDeps, ...deps } spread happens on every metrics function call even when deps is empty (the common production case). Minor allocation overhead.
No test for non-default encodings (cl100k_base, p50k_base, r50k_base). All tests use o200k_base.

Overall this is a solid performance improvement with clean architecture simplification. The main concern is the silent 0-token regression for files with special token sequences (#1 above). The rest are minor improvements that could be addressed in follow-up PRs.

🤖 Generated with Claude Code

claude · 2026-03-28T14:21:51Z

Code Review — `perf(core): Replace tiktoken WASM with gpt-tokenizer`

Good iteration since the last review — the special token fallback in calculateSelectiveFileMetrics (commit 187cd03) addresses the biggest concern. A few new/remaining findings below.

1. `countTokensPlainText` fallback missing in output/git metrics paths

The fallback (countTokens → countTokensPlainText when result is 0 for non-empty content) was added to calculateSelectiveFileMetrics but not to calculateOutputMetrics, calculateGitDiffMetrics, or calculateGitLogMetrics. The packed output embeds all file contents, so if any file contains special token sequences like <|endoftext|>, the entire output token count could silently return 0 — while per-file counts correctly fall back. This would produce confusing results where totalTokens is 0 but individual file token counts are non-zero.

Consider using countTokensPlainText directly in these callers (since the comment notes it's actually faster — it skips the special token regex scan), or applying the same fallback pattern.

2. JSDoc references non-existent method `countTokensSafe()`

In src/core/metrics/TokenCounter.ts, the JSDoc on countTokens() says:

Use countTokensSafe() to handle all files correctly.

The actual method is countTokensPlainText(). Stale reference that will confuse callers.

3. No test coverage for `countTokensPlainText()` or the fallback path

Details

countTokensPlainText() is a new public method with no unit test in TokenCounter.test.ts
The fallback path in calculateSelectiveFileMetrics (lines 37–39: retry with countTokensPlainText when countTokens returns 0 for non-empty content) has no test either
A test with content containing <|endoftext|> would exercise both paths and serve as a regression anchor for the core behavioral change in this PR

4. Asymmetric error handling: git diff rethrows, git log swallows

calculateGitDiffMetrics rethrows errors (the catch block logs then throw error), while calculateGitLogMetrics catches and returns { gitLogTokenCount: 0 }. Both are called from Promise.all in calculateMetrics. If the git diff path throws, the entire metrics calculation fails — but the same error in git logs is silently swallowed. This asymmetry looks unintentional.

Previously flagged (still open)

Items 3–6 from the previous review are still applicable:

Dead worker code (calculateMetricsWorker.ts + references in processConcurrency.ts/unifiedWorker.ts)
Cross-feature dependency (config → core/metrics for TOKEN_ENCODINGS)
Weak test assertions (toBeGreaterThan(0) everywhere)
Error context dropped in catch blocks

Overall this is well-architected and the benchmark improvements are solid. Items 1 and 2 above are the most actionable for this PR; the rest could be follow-ups.

🤖 Generated with Claude Code

…rics pipeline Replace tiktoken (WASM-based) with gpt-tokenizer (pure JS) for token counting. This eliminates ~200ms WASM initialization overhead and removes the need for a dedicated worker pool for metrics calculation. - Swap tiktoken dependency for gpt-tokenizer in package.json - Rewrite TokenCounter to use gpt-tokenizer's encode API - Remove TaskRunner/worker pool infrastructure from calculateMetrics - Remove metricsTaskRunner pre-warming from packager pipeline - Update TokenEncoding type in config schema - Simplify all metrics calculation modules to use direct token counting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ve fallback Replace unsafe type assertion (`val as TokenEncoding`) in config schema with Zod `.enum()` validation using a shared `TOKEN_ENCODINGS` constant. This prevents arbitrary strings from reaching the dynamic import in `TokenCounter.loadEncoding()`. Also remove the ineffective fallback in `TokenCounter.countTokens()` that re-called the same cached function after it already threw, making the retry logic dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e condition Restore special token handling that was lost in the tiktoken-to-gpt-tokenizer migration. Files containing sequences like <|endoftext|> now correctly fall back to countTokens with allowedSpecial: 'all' instead of returning 0. Also memoize in-flight initialization promises in getTokenCounter() to prevent duplicate TokenCounter instances when called concurrently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…token counting Replace the two-phase approach (fast path without options → fallback with allowedSpecial: 'all') with a single call using { disallowedSpecial: new Set() }. The previous approach caused benchmark regressions (macOS +51%, Windows +7.8%) because gpt-tokenizer's default disallowedSpecial: 'all' throws on any text matching special token patterns, triggering the costly fallback on many files. Additionally, allowedSpecial: 'all' had incorrect semantics — it counted <|endoftext|> as 1 control token instead of 7 text tokens, diverging from the old tiktoken behavior. Using { disallowedSpecial: new Set() } treats all content as plain text, matching tiktoken's encode(content, [], []). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Avoid passing options to gpt-tokenizer's countTokens() on every call. When options are provided, gpt-tokenizer calls processSpecialTokens() each time instead of using its pre-cached defaultSpecialTokenConfig, adding significant per-call overhead. Use the no-options fast path by default, and fall back to { disallowedSpecial: new Set() } only for the rare files (~0.1%) that contain special token sequences like <|endoftext|>. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tTokens V8 deoptimizes the entire try block when the catch block contains non-trivial logic (method calls, nested try/catch). Local benchmarks showed +25% regression (~1050ms → ~1330ms) from adding a fallback retry in the catch path. Keep countTokens() catch block minimal (log + return 0) to match the structure V8 optimizes well. Provide special token handling as a separate countTokensPlainText() method that callers can use when they know content contains special token sequences. Also revert tokenCounterFactory race condition fix — the IIFE-based Promise memoization added complexity without measurable benefit since getTokenCounter is effectively serialized by the pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… metrics Add p50k_edit to TOKEN_ENCODINGS for backward compatibility with users who had this tiktoken encoding in their config. Add countTokensPlainText() method that uses { disallowedSpecial: new Set() } to match tiktoken's encode(content, [], []) behavior, treating special tokens like <|endoftext|> as ordinary text. In the file metrics hot loop, retry with countTokensPlainText() when countTokens() returns 0 for non-empty files, handling the rare (~0.1%) files containing special token sequences without affecting hot path performance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…st assertions Use countTokensPlainText() for git diff and git log token counting to correctly handle special token sequences in diffs. These are cold paths (1-2 calls each) so the processSpecialTokens overhead is negligible. Strengthen test assertions from toBeGreaterThan(0) to exact token counts verified against gpt-tokenizer's o200k_base encoding. This catches encoding regressions and tokenizer implementation drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The output content often contains special token sequences like <|endoftext|> from packed source files (e.g., TokenCounter.ts comments, release notes). Using countTokens() on this content causes gpt-tokenizer to throw, silently returning 0 for the entire output — making Total Tokens show as 0 in the summary. Switch calculateOutputMetrics and calculateMetricsWorker to use countTokensPlainText() which matches tiktoken's encode(content, [], []) behavior, treating all content as ordinary text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude · 2026-03-28T15:25:41Z

Code Review — `perf(core): Replace tiktoken WASM with gpt-tokenizer`

Nice performance win! The removal of WASM initialization overhead and worker pool complexity is a solid simplification. A few observations worth considering:

Findings

1. Potentially dead code: calculateMetricsWorker.ts and related infrastructure
Severity: Medium

The PR removes all TaskRunner usage for metrics from the main code path, but calculateMetricsWorker.ts was updated (not removed). It's still registered in src/shared/unifiedWorker.ts (line 52-55) as workerType: 'calculateMetrics' and referenced in inferWorkerTypeFromTask. Since no code creates a TaskRunner with this worker type anymore, this is dead code. Consider removing or adding a TODO.

2. Inconsistent use of countTokens vs countTokensPlainText
Severity: Low

In calculateSelectiveFileMetrics.ts, there's a two-phase approach: try countTokens() first, fall back to countTokensPlainText() if result is 0. But calculateOutputMetrics, calculateGitDiffMetrics, calculateGitLogMetrics, and the worker all use only countTokensPlainText().

The comment in TokenCounter.ts says countTokens() is the "fast path" that may fail on special tokens. If countTokensPlainText() is also described as "faster than the default" (line 16-17), it's unclear what performance benefit the two-phase approach provides for file metrics. Consider either using countTokensPlainText() consistently everywhere, or documenting why the two-phase approach is needed only for file metrics.

3. Two-level caching may be redundant
Severity: Low

There are two caches: encodingModules (module-level Map in TokenCounter.ts) and tokenCounters (module-level Map in tokenCounterFactory.ts). Since TokenCounter.init() calls loadEncoding() which checks encodingModules, and the factory caches TokenCounter instances, the encoding module cache is effectively redundant when going through the factory. Not a bug, but worth noting for clarity.

4. Sequential file metrics processing
Severity: Info

calculateSelectiveFileMetrics switched from parallel Promise.all with worker threads to a sequential for loop on the main thread. The benchmarks show this is faster for typical repos (WASM overhead > parallelism benefit), but for very large repos with thousands of files, this blocks the main thread. The benchmarks validate the current approach, so this is fine as-is — just worth monitoring if users report slowness on very large codebases.

Overall this is a clean, well-structured PR that achieves meaningful perf improvements through simplification. The main actionable item is cleaning up the now-unused calculateMetricsWorker.ts and its registration in unifiedWorker.ts.

🤖 Generated with Claude Code

devin-ai-integration

Devin Review found 1 new potential issue.

View 12 additional findings in Devin Review.

devin-ai-integration · 2026-03-28T15:27:06Z

+export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => {
  let tokenCounter = tokenCounters.get(encoding);
  if (!tokenCounter) {
    tokenCounter = new TokenCounter(encoding);
+    await tokenCounter.init();
    tokenCounters.set(encoding, tokenCounter);
  }
  return tokenCounter;


🟡 Race condition in async getTokenCounter allows duplicate TokenCounter creation

The getTokenCounter function has a classic TOCTOU (time-of-check-time-of-use) race condition. It checks the map, then awaits init(), then sets the map. If two concurrent callers request the same encoding before either completes, both see the map as empty, both create and initialize a TokenCounter, and the second one overwrites the first in the map. The first caller gets a TokenCounter instance that is not stored in the cache.

The comment in calculateMetrics.ts:58-59 acknowledges sequential ordering is needed ("File metrics must run first... then output/git metrics can run in parallel since they share the cached TokenCounter"), but if calculateSelectiveFileMetrics returns early at calculateSelectiveFileMetrics.ts:24-26 (when filesToProcess.length === 0), the counter is never cached and the subsequent Promise.all at calculateMetrics.ts:67 triggers concurrent calls to getTokenCounter from calculateOutputMetrics, calculateGitDiffMetrics, and calculateGitLogMetrics.

The fix is to cache the Promise<TokenCounter> instead of the resolved value, ensuring all concurrent callers share the same initialization promise.

Suggested change

export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => {

let tokenCounter = tokenCounters.get(encoding);

if (!tokenCounter) {

tokenCounter = new TokenCounter(encoding);

await tokenCounter.init();

tokenCounters.set(encoding, tokenCounter);

}

return tokenCounter;

export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => {

let pending = tokenCounterPromises.get(encoding);

if (!pending) {

pending = (async () => {

const tokenCounter = new TokenCounter(encoding);

await tokenCounter.init();

tokenCounters.set(encoding, tokenCounter);

return tokenCounter;

})();

tokenCounterPromises.set(encoding, pending);

}

return pending;

};

Was this helpful? React with 👍 or 👎 to provide feedback.

The calculateMetricsWorker and its 'calculateMetrics' worker type are no longer used — all token counting now runs on the main thread via gpt-tokenizer. Remove the worker file, its case branches in processConcurrency and unifiedWorker, and the associated tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use the same retry pattern as file metrics: try countTokens() first (fast path, uses cached defaultSpecialTokenConfig), and only fall back to countTokensPlainText() when the result is 0 for non-empty content. This avoids processSpecialTokens() overhead for repos without special token sequences while still correctly counting tokens when they appear. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yamadashy · 2026-03-28T15:54:36Z

Background and Change of Approach

This PR replaced tiktoken (WASM) with gpt-tokenizer (pure JS) and removed the worker pool to run token counting directly on the main thread.

Issues Discovered

Output token count silently becomes 0: Repomix's own source code (TokenCounter.ts comments) contains <|endoftext|>, which causes gpt-tokenizer's default behavior (disallowedSpecial: 'all') to throw — making Total Tokens report as 0
countTokensPlainText() overhead: Passing { disallowedSpecial: new Set() } forces gpt-tokenizer to call processSpecialTokens() on every invocation. For the ~3.7MB output content, this adds significant latency
V8 inlining issue: Adding retry logic (calling countTokensPlainText()) inside the countTokens() catch block inflates bytecode size beyond V8 TurboFan's inlining budget, causing +25% performance regression. Modern V8 doesn't "deoptimize try blocks" (that was Crankshaft-era), but catch block complexity still affects inlining decisions and generated code size
Loss of worker parallelism: Sequential main-thread processing performs comparably on Ubuntu, but regresses on macOS (+33%) and Windows (+10%) where worker pool parallelism provided more benefit

New Approach

Based on these findings, the optimal approach is to keep the worker pool and only swap the tokenizer library. Created #1350 with this approach:

Use { disallowedSpecial: new Set() } in workers (no V8 inlining impact in worker context)
Preserve parallel chunk processing and pre-warming infrastructure
Only eliminate WASM initialization overhead (~200ms)

Learnings from this PR (Zod enum validation, TOKEN_ENCODINGS constant, p50k_edit support, exact token count test assertions) are carried forward into #1350.