Skip to content

perf(core): Replace tiktoken WASM with gpt-tokenizer and move token counting to main thread#1334

Closed
yamadashy wants to merge 1 commit intomainfrom
perf/replace-tiktoken-wasm
Closed

perf(core): Replace tiktoken WASM with gpt-tokenizer and move token counting to main thread#1334
yamadashy wants to merge 1 commit intomainfrom
perf/replace-tiktoken-wasm

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Mar 28, 2026

Summary

Replace tiktoken (WASM-based) with gpt-tokenizer (pure JavaScript) for token counting, and move all token counting from worker threads to the main thread.

  • Replace tiktoken with gpt-tokenizer: Pure JS implementation eliminates WASM initialization overhead and native dependency complexity. Supports all encodings (o200k_base, cl100k_base, p50k_base, r50k_base) with identical token counts.
  • Move token counting to main thread: With a pure JS tokenizer, worker thread overhead (structured clone serialization of file content, IPC, pool management) exceeds computation cost. Eliminates metrics worker pool entirely.
  • Simplify metrics API: All calculate*Metrics functions use direct TokenCounter via getTokenCounter() instead of taskRunner dependency.
  • Cache secretlint config at worker module level: Config is stateless and identical for every task; lazy-initialized once per worker instead of recreated per task.

Benchmark results (15 runs each, packing repomix repo)

Platform Improvement
Ubuntu -0.73s
Windows -0.74s
macOS -0.52s

Additional benefits

  • Removes native WASM dependency (better portability, simpler CI)
  • Reduces per-worker memory (no WASM allocation per thread)
  • Simpler architecture (no metrics worker pool lifecycle)
  • Net -705 lines of code

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

…ounting to main thread

Replace tiktoken (WASM-based) with gpt-tokenizer (pure JavaScript) for token
counting, and move all token counting from worker threads to the main thread.

Key changes:
- Replace tiktoken with gpt-tokenizer: Pure JS implementation eliminates WASM
  initialization overhead and native dependency complexity. Supports all
  encodings (o200k_base, cl100k_base, p50k_base, r50k_base) with identical
  token counts.
- Move token counting to main thread: With pure JS tokenizer, worker thread
  overhead (structured clone serialization of file content, IPC, pool management)
  exceeds computation cost. Eliminates metrics worker pool entirely.
- Remove metrics worker pool pre-warming: No longer needed without WASM init.
- Cache secretlint config at worker module level: Config is stateless and
  identical for every task; lazy-initialized once per worker instead of
  recreated per task.
- Simplify metrics API: All calculate*Metrics functions use direct
  TokenCounter via getTokenCounter() instead of taskRunner dependency.

Benchmark (15 runs each, packing repomix repo):

| Branch               | Median  | P25     | P75     |
|----------------------|---------|---------|---------|
| Before (tiktoken)    | 2848ms  | 2800ms  | 2907ms  |
| After (gpt-tokenizer)| 2653ms  | 2621ms  | 2724ms  |
| Improvement          | -195ms (-6.8%) |   |         |

Additional benefits:
- Removes native WASM dependency (better portability, simpler CI)
- Reduces per-worker memory (no WASM allocation per thread)
- Simpler architecture (no metrics worker pool lifecycle)
- Net -702 lines of code

https://claude.ai/code/session_01XSFjENwp9zyfRrTNtEMa2w
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

⚡ Performance Benchmark

Latest commit:ff21257 perf(core): Replace tiktoken WASM with gpt-tokenizer and move token counting to main thread
Status:✅ Benchmark complete!
Ubuntu:2.20s (±0.01s) → 1.57s (±0.01s) · -0.63s (-28.5%)
macOS:1.16s (±0.07s) → 0.91s (±0.02s) · -0.25s (-21.8%)
Windows:2.68s (±0.01s) → 2.04s (±0.03s) · -0.64s (-23.7%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded)
  • Measurement: 10 runs / 20 on macOS (median ± IQR)
  • Workflow run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

This PR replaces the tiktoken library with gpt-tokenizer and refactors token counting from a worker-pool-based architecture to synchronous main-thread execution with lazy async initialization. The TokenCounter now dynamically imports encoding modules on demand and caches the token-counting function, while all metrics calculations run directly on the main thread rather than distributing work across task runners.

Changes

Cohort / File(s) Summary
Dependency Management
package.json
Replaced tiktoken with gpt-tokenizer as a runtime dependency.
Token Counter Core
src/core/metrics/TokenCounter.ts, src/core/metrics/tokenCounterFactory.ts
Replaced synchronous tiktoken encoding with async lazy initialization that dynamically imports gpt-tokenizer/encoding/${encodingName}. Introduced TokenEncoding type, async init() method, and caching of token-counting functions. Changed getTokenCounter from synchronous to async and updated to initialize instances before returning.
Configuration & Types
src/config/configSchema.ts, src/core/metrics/workers/calculateMetricsWorker.ts
Updated imports and type casts to use TokenEncoding instead of TiktokenEncoding.
Metrics Calculation Functions
src/core/metrics/calculateGitDiffMetrics.ts, src/core/metrics/calculateGitLogMetrics.ts, src/core/metrics/calculateOutputMetrics.ts, src/core/metrics/calculateSelectiveFileMetrics.ts
Replaced worker-based task runner dependencies with main-thread getTokenCounter dependency. Removed TaskRunner parameters and replaced async parallelization with synchronous token counting via TokenCounter.countTokens().
Metrics Pipeline
src/core/metrics/calculateMetrics.ts
Removed createMetricsTaskRunner export and all task runner wiring. Updated defaultDeps to use getTokenCounter instead of taskRunner. Changed control flow to compute selective file metrics first, then run output and git metrics concurrently.
Packager & Worker Integration
src/core/packager.ts, src/core/security/workers/securityCheckWorker.ts
Removed metrics task runner creation, warm-up, and cleanup logic from packager. Updated progress reporting stages. Introduced config caching in security worker to avoid per-task instantiation.
Test Suite Updates
tests/core/metrics/TokenCounter.test.ts, tests/core/metrics/calculateGitDiffMetrics.test.ts, tests/core/metrics/calculateGitLogMetrics.test.ts, tests/core/metrics/calculateOutputMetrics.test.ts, tests/core/metrics/calculateSelectiveFileMetrics.test.ts, tests/core/metrics/calculateMetrics.test.ts, tests/core/metrics/diffTokenCount.test.ts, tests/core/packager*.test.ts, tests/integration-tests/packager.test.ts
Replaced mocked taskRunner and worker-based token counting with TokenCounter and getTokenCounter mocks. Updated assertions to verify main-thread token counting. Removed deterministic token-count assertions in favor of > 0 checks. Updated error scenarios to simulate getTokenCounter failures.

Sequence Diagram

sequenceDiagram
    actor Client
    participant PackagerModule as Packager
    participant MetricsCalc as calculateMetrics
    participant TokenFactory as getTokenCounter
    participant TokenCounter as TokenCounter
    participant Encoding as gpt-tokenizer/encoding

    Client->>PackagerModule: pack(config, deps)
    PackagerModule->>MetricsCalc: calculateMetrics(config, files, gitData)
    
    rect rgba(100, 150, 200, 0.5)
    Note over MetricsCalc: Process selective files
    MetricsCalc->>TokenFactory: getTokenCounter('o200k_base')
    TokenFactory->>TokenCounter: new TokenCounter('o200k_base')
    TokenFactory->>TokenCounter: init()
    TokenCounter->>Encoding: import encoding module
    Encoding-->>TokenCounter: countTokens function
    TokenCounter-->>TokenFactory: initialized instance
    TokenFactory-->>MetricsCalc: cached TokenCounter
    end
    
    rect rgba(100, 150, 200, 0.5)
    Note over MetricsCalc: Process output metrics
    MetricsCalc->>TokenFactory: getTokenCounter('o200k_base')
    TokenFactory-->>MetricsCalc: cached TokenCounter (no reinit)
    MetricsCalc->>TokenCounter: countTokens(content)
    TokenCounter-->>MetricsCalc: token count
    end
    
    rect rgba(100, 150, 200, 0.5)
    Note over MetricsCalc: Process git metrics
    par Git Diffs
        MetricsCalc->>TokenCounter: countTokens(diffContent)
    and Git Logs
        MetricsCalc->>TokenCounter: countTokens(logContent)
    end
    end
    
    MetricsCalc-->>PackagerModule: metrics result
    PackagerModule-->>Client: packaged output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: replacing tiktoken with gpt-tokenizer and moving token counting to the main thread.
Description check ✅ Passed The description provides a comprehensive summary with detailed explanations, benchmark results, and benefits. Both required checklist items are marked as completed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/replace-tiktoken-wasm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 91.80328% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.63%. Comparing base (d762d38) to head (ff21257).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
src/core/metrics/TokenCounter.ts 78.57% 6 Missing ⚠️
src/core/metrics/calculateMetrics.ts 95.65% 1 Missing ⚠️
src/core/metrics/tokenCounterFactory.ts 66.66% 1 Missing ⚠️
src/core/metrics/workers/calculateMetricsWorker.ts 0.00% 1 Missing ⚠️
src/core/packager.ts 96.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1334      +/-   ##
==========================================
- Coverage   87.13%   86.63%   -0.50%     
==========================================
  Files         115      115              
  Lines        4367     4369       +2     
  Branches     1015     1015              
==========================================
- Hits         3805     3785      -20     
- Misses        562      584      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: ff21257
Status: ✅  Deploy successful!
Preview URL: https://57fd9a7f.repomix.pages.dev
Branch Preview URL: https://perf-replace-tiktoken-wasm.repomix.pages.dev

View logs

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the tiktoken library with gpt-tokenizer and migrates token counting from a worker pool to the main thread. It introduces lazy loading for encoding modules and optimizes the packaging process by parallelizing git operations with file collection. Feedback indicates that the fallback logic for special tokens in TokenCounter is incorrectly implemented and suggests removing the now-obsolete metrics worker directory.

Comment on lines +51 to 69
} catch {
// Fallback: try with allowedSpecial for files containing special tokens
try {
const mod = encodingModules.get(this.encodingName);
if (mod) {
return mod(content);
}
} catch {
// ignore
}

if (filePath) {
logger.warn(`Failed to count tokens. path: ${filePath}, error: ${message}`);
logger.warn(`Failed to count tokens. path: ${filePath}`);
} else {
logger.warn(`Failed to count tokens. error: ${message}`);
logger.warn('Failed to count tokens.');
}

return 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fallback logic for handling special tokens is not implemented as the comment suggests. The current implementation re-attempts the same failing countTokens function, which will throw an error again. This error is then silently ignored by the inner catch block, causing the function to incorrectly return 0 for files that contain special tokens.

To correctly implement the fallback, you should use the encode function from gpt-tokenizer with the { allowedSpecial: 'all' } option. This requires a small refactor to make the encode function available synchronously within countTokens.

I recommend updating loadEncoding to cache and return an object containing both countTokens and encode functions. This object can then be stored in the TokenCounter instance during init() and used in the fallback path.

} catch (initialError) {
  // Fallback: try with allowedSpecial for files containing special tokens
  try {
    // This assumes `this.tokenizer.encode` is available on the class instance,
    // loaded and cached during `init()`.
    return this.tokenizer.encode(content, { allowedSpecial: 'all' }).length;
  } catch (fallbackError) {
    // If the fallback also fails, log both errors for debugging.
    const pathInfo = filePath ? `path: ${filePath}, ` : '';
    logger.warn(
      `Failed to count tokens. ${pathInfo}Initial error: ${initialError}, Fallback error: ${fallbackError}`,
    );
  }

  return 0;
}

Comment on lines 1 to 11
import { logger, setLogLevelByWorkerData } from '../../../shared/logger.js';
import type { TokenEncoding } from '../TokenCounter.js';
import { freeTokenCounters, getTokenCounter } from '../tokenCounterFactory.js';

/**
* Simple token counting worker for metrics calculation.
*
* This worker provides a focused interface for counting tokens from text content,
* using the Tiktoken encoding. All complex metric calculation logic is handled
* using gpt-tokenizer. All complex metric calculation logic is handled
* by the calling side to maintain separation of concerns.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Given that this pull request's summary states it 'Eliminates metrics worker pool entirely', this worker file and its containing directory (src/core/metrics/workers) appear to be obsolete.

To fully align with the new single-threaded architecture for metrics calculation and simplify the codebase, please consider removing this file and the workers directory.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +51 to +57
} catch {
// Fallback: try with allowedSpecial for files containing special tokens
try {
const mod = encodingModules.get(this.encodingName);
if (mod) {
return mod(content);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Fallback for special tokens retries with identical call, never succeeding

In TokenCounter.countTokens, the outer catch block (line 51) is meant to handle files containing special token sequences like <|endoftext|> by retrying with allowedSpecial. However, the fallback on lines 53-56 retrieves the exact same function from encodingModules (stored at src/core/metrics/TokenCounter.ts:20 during loadEncoding) and calls it with the exact same content argument and no options. Since mod === this.countFn, the retry will always throw the same error.

The old tiktoken code handled this correctly: this.encoding.encode(content, [], []) — the empty arrays disabled special token validation. The gpt-tokenizer equivalent would be passing { allowedSpecial: 'all' } as the second argument to countTokens. Without this, any file containing special token sequences (e.g., tokenizer config files with <|endoftext|>) will silently return 0 tokens instead of being counted correctly.

Prompt for agents
In src/core/metrics/TokenCounter.ts, the catch block at lines 51-60 needs to be fixed. The fallback retry currently calls the same countTokens function with the same arguments, which will always fail identically.

The fix: import the ALL_SPECIAL_TOKENS constant from gpt-tokenizer (exported as a named export from gpt-tokenizer/encoding/<name>) and use it in the fallback call. The inner try block (lines 53-56) should pass allowedSpecial option to the countFn call, like:

  return mod(content, { allowedSpecial: new Set(['all']) });

or, if gpt-tokenizer exports an ALL_SPECIAL_TOKENS sentinel:

  import { AllSpecialTokens } from 'gpt-tokenizer';
  return mod(content, { allowedSpecial: AllSpecialTokens });

You may need to also update the type of countFn in the Map and loadEncoding to accept the second encodeOptions parameter, e.g.:
  type CountFn = (text: string, options?: { allowedSpecial?: Set<string> | 'all' }) => number;

This ensures files containing special token sequences like <|endoftext|> get their tokens counted correctly instead of silently returning 0.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 28, 2026

🔍 Code Review — PR #1334

Great PR! Replacing tiktoken WASM with gpt-tokenizer is a clean simplification that removes native dependency complexity and improves performance. The net -705 lines and benchmark improvements are compelling. A few items worth addressing:


1. 🐛 Dead fallback in TokenCounter.countTokens

src/core/metrics/TokenCounter.ts:51-60

The catch block's "fallback" retrieves encodingModules.get(this.encodingName) — but this.countFn was assigned from that same cache during init(). The fallback calls the exact same function with the exact same arguments, so it will always throw identically. The comment says "try with allowedSpecial" but no such option is passed.

This was meaningful with tiktoken (which supported re-encoding with allowedSpecial: 'all'), but got lost in translation. Either:

  • Investigate if gpt-tokenizer has an equivalent option for special token handling and implement a real fallback, or
  • Remove the dead catch block entirely
Details
// Current (dead code):
} catch {
  try {
    const mod = encodingModules.get(this.encodingName);
    if (mod) return mod(content); // same as this.countFn — will throw identically
  } catch { /* ignore */ }
  // ...
  return 0;
}

// Suggested — just remove the inner try/catch:
} catch {
  if (filePath) logger.warn(`Failed to count tokens. path: ${filePath}`);
  else logger.warn('Failed to count tokens.');
  return 0;
}

2. 🔒 Config schema doesn't validate encoding at runtime

src/config/configSchema.ts:125-128

The encoding field uses z.string().transform((val) => val as TokenEncoding) — the as cast provides zero runtime validation. Any string from user config passes through to the dynamic import() in TokenCounter.ts. While Node's module resolution constrains exploitability, using z.enum() is a straightforward fix:

encoding: z.enum(['o200k_base', 'cl100k_base', 'p50k_base', 'r50k_base']).default('o200k_base'),

Note: This was already the case before this PR (z.string()as TiktokenEncoding), but since the import is now from a config type into a dynamic import() path, tightening the validation is worth doing.


3. 🧹 calculateMetricsWorker.ts is now dead code

src/core/metrics/workers/calculateMetricsWorker.ts

No production code calls initTaskRunner with workerType: 'calculateMetrics' anymore, yet the worker file was updated in this PR and the worker type is still registered in unifiedWorker.ts and processConcurrency.ts. Consider removing it (and the corresponding type/routing) to avoid confusing future contributors.


4. 📐 Cross-feature import in configSchema.ts

src/config/configSchema.ts:2

import type { TokenEncoding } from '../core/metrics/TokenCounter.js';

Per project conventions, config should not depend on core/metrics. The TokenEncoding type is a simple union — consider defining it in shared/ or co-locating it in config/ to keep the dependency direction clean.


5. 🧪 Test precision weakened

Several tests changed from exact assertions (expect(result).toBe(8)) to toBeGreaterThan(0). Since the real tokenizer produces deterministic counts, consider pinning a few representative values (e.g., 'test content'2 tokens with o200k_base, as already done in calculateOutputMetrics.test.ts) to catch tokenizer regressions. At minimum, the "both workTree and staged diffs" test should verify that the combined result exceeds either individual part.


Minor / Informational

Details
  • Race in loadEncoding/getTokenCounter: Two concurrent callers could both see a cache miss and trigger duplicate dynamic imports. Currently safe due to the sequential-then-parallel flow in calculateMetrics, but fragile. Caching the Promise instead of the result would make it robust.

  • freeTokenCounters() is unreachable: No live code path calls it since the metrics worker pool was removed. Consider removing it or wiring it up for MCP server lifecycle cleanup.

  • Main-thread blocking for large repos: The sequential for loop in calculateSelectiveFileMetrics blocks the event loop during tokenization. For typical repos the benchmarks show this is faster, but for very large repos the progress callbacks won't visually update until the loop completes. A periodic setImmediate yield could help if this becomes noticeable.


Overall this is a well-motivated change with solid benchmarks. The main actionable items are the dead fallback code (#1) and the schema validation (#2). Nice work on the simplification! 👍

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
src/core/metrics/workers/calculateMetricsWorker.ts (1)

17-28: This keeps the old token-count worker path alive.

src/shared/unifiedWorker.ts:52-56 and src/shared/unifiedWorker.ts:98-101 still route { content, encoding } tasks here, so this module preserves a second execution path even though the rest of the refactor moved metrics to the main thread. Please verify that no producer still emits TokenCountTask, or remove this registration so the main-thread path is the only supported one.

Also applies to: 43-50

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/metrics/workers/calculateMetricsWorker.ts` around lines 17 - 28, The
TokenCountTask/countTokens worker still remains registered and is being routed
from unifiedWorker, preserving an old execution path; either remove the worker
registration and any exports like TokenCountTask/countTokens so only the
main-thread metric path is used, or find and update all producers that emit
TokenCountTask to instead use the main-thread API (the unifiedWorker routing
locations around the code that sends { content, encoding }), ensuring no code
still produces TokenCountTask; update or delete the registration in this module
accordingly and run a project-wide search for TokenCountTask/countTokens to
confirm no producers remain.
src/core/metrics/tokenCounterFactory.ts (1)

12-17: Cache the in-flight initialization, not only the finished counter.

Because Line 16 awaits init() before Line 17 stores anything in tokenCounters, two concurrent calls for the same encoding can each allocate and initialize their own TokenCounter. That breaks the “one counter per encoding” guarantee and gives back some of the perf/memory win on parallel paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/metrics/tokenCounterFactory.ts` around lines 12 - 17,
getTokenCounter currently waits for tokenCounter.init() before storing the
instance in tokenCounters, allowing concurrent callers to each allocate/init
their own TokenCounter; change to cache the in-flight initialization so only one
init runs per encoding: when no entry exists in tokenCounters, create a
placeholder (e.g., store a Promise or a wrapper) before awaiting init, set that
placeholder into tokenCounters immediately, await the init on that stored value,
and ensure subsequent calls to getTokenCounter return the same stored
Promise/instance; refer to getTokenCounter, tokenCounters, TokenCounter, and
TokenCounter.init when making the change.
tests/core/metrics/TokenCounter.test.ts (1)

19-71: The new tests are too permissive for a tokenizer swap.

Every non-empty case only asserts > 0, so loading the wrong encoding module or returning a constant positive number would still pass. Please pin a small golden corpus with explicit counts—ideally including a special-token example—so the parity claim in this PR is actually regression-tested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/metrics/TokenCounter.test.ts` around lines 19 - 71, The tests are
too permissive—every non-empty input just asserts > 0—so update the
TokenCounter.test.ts cases to use a small golden corpus with deterministic
expected token counts: replace the loose assertions in tests that call
tokenCounter.countTokens (e.g., the "should correctly count tokens for simple
text", "multi-line text", "special characters", "unicode characters", "code
snippets", "markdown text", and "very long text" cases) with explicit
expect(...).toBe(<expectedCount>) assertions using known counts for each sample
string (include at least one sample that contains a special token/example from
your tokenizer to assert its exact tokenization), so the tests will fail if the
encoding module or tokenization logic changes unexpectedly.
tests/core/metrics/calculateGitDiffMetrics.test.ts (1)

108-117: These assertions no longer verify the aggregation logic.

toBeGreaterThan(0) still passes if calculateGitDiffMetrics() accidentally drops either workTreeDiffContent or stagedDiffContent. Computing the expected total with the same initialized counter, or injecting a trivial stub counter, keeps the test decoupled from tokenizer internals while still checking the sum.

✅ Example of a stronger assertion for the two-diff case
-      expect(result).toBeGreaterThan(0);
+      const counter = await mockGetTokenCounter();
+      expect(result).toBe(
+        counter.countTokens('work tree changes') + counter.countTokens('staged changes'),
+      );

Also applies to: 120-129, 132-141, 144-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/metrics/calculateGitDiffMetrics.test.ts` around lines 108 - 117,
The test currently only asserts result > 0 which doesn't verify aggregation;
instead instantiate or stub the token counters used by mockGetTokenCounter to
produce deterministic counts for each diff (e.g., a counter returning N tokens
for workTreeDiffContent and M tokens for stagedDiffContent), compute
expectedTotal = N + M and assert calculateGitDiffMetrics(..., { getTokenCounter:
mockGetTokenCounter }) === expectedTotal; apply the same stronger assertion
pattern to the other test cases that cover single/dual diffs (the tests around
the blocks referenced by this comment) so they verify explicit sums rather than
just > 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/config/configSchema.ts`:
- Around line 125-128: The schema currently uses encoding:
z.string().default('o200k_base').transform((val) => val as TokenEncoding) which
blindly casts any string to TokenEncoding; change the schema to explicitly
validate allowed encodings (e.g., use z.enum(...) or z.union of z.literal(...))
listing the supported TokenEncoding values, remove the unsafe transform/cast,
and ensure the default is one of those literals so invalid encoding strings are
rejected during config parsing (this prevents later failures in
getTokenCounter).

In `@src/core/metrics/calculateGitLogMetrics.ts`:
- Around line 40-42: The catch block in calculateGitLogMetrics currently logs
the error and returns { gitLogTokenCount: 0 }, which masks tokenizer failures
and undercounts packed output; change this to rethrow the error (or throw a new
Error with context) instead of returning zero so callers can handle failures
consistently (same behavior as calculateGitDiffMetrics). Keep the logger.error
call for diagnostics but remove the return and throw the original error (or a
wrapped error) from the catch in calculateGitLogMetrics.

In `@src/core/metrics/TokenCounter.ts`:
- Around line 46-68: The fallback currently re-invokes the same tokenizer and
returns 0, undercounting; update TokenCounter so that when this.countFn(content)
throws you retrieve the encoder module from
encodingModules.get(this.encodingName) and call the module's token-counting API
that enables special-token handling (e.g., the module's countTokens or
equivalent with allowedSpecial/processSpecialTokens options) instead of calling
the same failing function, capture and log the caught error details via
logger.warn including filePath, and only return 0 if the retry also fails.

In `@src/core/packager.ts`:
- Around line 93-111: The background gitPromise (created from deps.getGitDiffs
and deps.getGitLogs) can reject while the file collection (withMemoryLogging +
Promise.all over deps.collectFiles) fails first, producing an unhandled
rejection; fix by ensuring gitPromise is always awaited/consumed regardless of
the collectResults outcome: wrap the collection await in a try/finally (or
attach a .catch) and in the finally await gitPromise (or await
gitPromise.catch(() => {/* swallow/log */})) so errors from
deps.getGitDiffs/deps.getGitLogs are observed; reference the gitPromise variable
and the deps.getGitDiffs/deps.getGitLogs calls and ensure you still use the
resolved [gitDiffResult, gitLogResult] after successful collection.

In `@tests/core/metrics/calculateSelectiveFileMetrics.test.ts`:
- Around line 8-12: The current mockGetTokenCounter ignores the requested
encoding and only asserts counts are > 0, which doesn't verify per-file token
math; replace mockGetTokenCounter with a tiny deterministic fake token counter
that respects the encoding argument (or accepts encoding in its factory) and
returns exact token counts for the test strings (e.g., simple whitespace or
character-based counts) so the test can assert exact expected values instead of
> 0; update the assertions around mockGetTokenCounter and the tests referenced
(including the checks at lines with assertions 29-30 and 33-39) to compare
against those exact expected counts for each file rather than a generic positive
check.

---

Nitpick comments:
In `@src/core/metrics/tokenCounterFactory.ts`:
- Around line 12-17: getTokenCounter currently waits for tokenCounter.init()
before storing the instance in tokenCounters, allowing concurrent callers to
each allocate/init their own TokenCounter; change to cache the in-flight
initialization so only one init runs per encoding: when no entry exists in
tokenCounters, create a placeholder (e.g., store a Promise or a wrapper) before
awaiting init, set that placeholder into tokenCounters immediately, await the
init on that stored value, and ensure subsequent calls to getTokenCounter return
the same stored Promise/instance; refer to getTokenCounter, tokenCounters,
TokenCounter, and TokenCounter.init when making the change.

In `@src/core/metrics/workers/calculateMetricsWorker.ts`:
- Around line 17-28: The TokenCountTask/countTokens worker still remains
registered and is being routed from unifiedWorker, preserving an old execution
path; either remove the worker registration and any exports like
TokenCountTask/countTokens so only the main-thread metric path is used, or find
and update all producers that emit TokenCountTask to instead use the main-thread
API (the unifiedWorker routing locations around the code that sends { content,
encoding }), ensuring no code still produces TokenCountTask; update or delete
the registration in this module accordingly and run a project-wide search for
TokenCountTask/countTokens to confirm no producers remain.

In `@tests/core/metrics/calculateGitDiffMetrics.test.ts`:
- Around line 108-117: The test currently only asserts result > 0 which doesn't
verify aggregation; instead instantiate or stub the token counters used by
mockGetTokenCounter to produce deterministic counts for each diff (e.g., a
counter returning N tokens for workTreeDiffContent and M tokens for
stagedDiffContent), compute expectedTotal = N + M and assert
calculateGitDiffMetrics(..., { getTokenCounter: mockGetTokenCounter }) ===
expectedTotal; apply the same stronger assertion pattern to the other test cases
that cover single/dual diffs (the tests around the blocks referenced by this
comment) so they verify explicit sums rather than just > 0.

In `@tests/core/metrics/TokenCounter.test.ts`:
- Around line 19-71: The tests are too permissive—every non-empty input just
asserts > 0—so update the TokenCounter.test.ts cases to use a small golden
corpus with deterministic expected token counts: replace the loose assertions in
tests that call tokenCounter.countTokens (e.g., the "should correctly count
tokens for simple text", "multi-line text", "special characters", "unicode
characters", "code snippets", "markdown text", and "very long text" cases) with
explicit expect(...).toBe(<expectedCount>) assertions using known counts for
each sample string (include at least one sample that contains a special
token/example from your tokenizer to assert its exact tokenization), so the
tests will fail if the encoding module or tokenization logic changes
unexpectedly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 63ed3617-8c29-4a0e-bbe0-14b3ffd966a7

📥 Commits

Reviewing files that changed from the base of the PR and between d762d38 and ff21257.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (23)
  • package.json
  • src/config/configSchema.ts
  • src/core/metrics/TokenCounter.ts
  • src/core/metrics/calculateGitDiffMetrics.ts
  • src/core/metrics/calculateGitLogMetrics.ts
  • src/core/metrics/calculateMetrics.ts
  • src/core/metrics/calculateOutputMetrics.ts
  • src/core/metrics/calculateSelectiveFileMetrics.ts
  • src/core/metrics/tokenCounterFactory.ts
  • src/core/metrics/workers/calculateMetricsWorker.ts
  • src/core/packager.ts
  • src/core/security/workers/securityCheckWorker.ts
  • tests/core/metrics/TokenCounter.test.ts
  • tests/core/metrics/calculateGitDiffMetrics.test.ts
  • tests/core/metrics/calculateGitLogMetrics.test.ts
  • tests/core/metrics/calculateMetrics.test.ts
  • tests/core/metrics/calculateOutputMetrics.test.ts
  • tests/core/metrics/calculateSelectiveFileMetrics.test.ts
  • tests/core/metrics/diffTokenCount.test.ts
  • tests/core/packager.test.ts
  • tests/core/packager/diffsFunctionality.test.ts
  • tests/core/packager/splitOutput.test.ts
  • tests/integration-tests/packager.test.ts
💤 Files with no reviewable changes (6)
  • tests/core/packager.test.ts
  • tests/core/metrics/calculateMetrics.test.ts
  • tests/integration-tests/packager.test.ts
  • tests/core/packager/diffsFunctionality.test.ts
  • tests/core/packager/splitOutput.test.ts
  • tests/core/metrics/diffTokenCount.test.ts

Comment on lines 125 to +128
encoding: z
.string()
.default('o200k_base')
.transform((val) => val as TiktokenEncoding),
.transform((val) => val as TokenEncoding),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate supported encodings here instead of casting.

z.string().default(...).transform((val) => val as TokenEncoding) still accepts any arbitrary string, so a typo survives config parsing and only fails later when getTokenCounter initializes. Since this PR only supports a fixed encoding set, reject invalid values in the schema.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/configSchema.ts` around lines 125 - 128, The schema currently uses
encoding: z.string().default('o200k_base').transform((val) => val as
TokenEncoding) which blindly casts any string to TokenEncoding; change the
schema to explicitly validate allowed encodings (e.g., use z.enum(...) or
z.union of z.literal(...)) listing the supported TokenEncoding values, remove
the unsafe transform/cast, and ensure the default is one of those literals so
invalid encoding strings are rejected during config parsing (this prevents later
failures in getTokenCounter).

Comment on lines 40 to +42
} catch (error) {
logger.error('Failed to calculate git log metrics:', error);
return {
gitLogTokenCount: 0,
};
return { gitLogTokenCount: 0 };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't turn tokenizer failures into a zero-token success.

When includeLogs is enabled, returning { gitLogTokenCount: 0 } here undercounts the actual packed output and can break split/limit decisions. This is also inconsistent with src/core/metrics/calculateGitDiffMetrics.ts, which rethrows the same class of failures.

🛠️ Suggested change
   } catch (error) {
     logger.error('Failed to calculate git log metrics:', error);
-    return { gitLogTokenCount: 0 };
+    throw error;
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} catch (error) {
logger.error('Failed to calculate git log metrics:', error);
return {
gitLogTokenCount: 0,
};
return { gitLogTokenCount: 0 };
} catch (error) {
logger.error('Failed to calculate git log metrics:', error);
throw error;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/metrics/calculateGitLogMetrics.ts` around lines 40 - 42, The catch
block in calculateGitLogMetrics currently logs the error and returns {
gitLogTokenCount: 0 }, which masks tokenizer failures and undercounts packed
output; change this to rethrow the error (or throw a new Error with context)
instead of returning zero so callers can handle failures consistently (same
behavior as calculateGitDiffMetrics). Keep the logger.error call for diagnostics
but remove the return and throw the original error (or a wrapped error) from the
catch in calculateGitLogMetrics.

Comment on lines 46 to 68
try {
// Disable special token validation to handle files that may contain
// special token sequences (e.g., tokenizer configs with <|endoftext|>).
// This treats special tokens as ordinary text rather than control tokens,
// which is appropriate for general code/text analysis where we're not
// actually sending the content to an LLM API.
return this.encoding.encode(content, [], []).length;
} catch (error) {
let message = '';
if (error instanceof Error) {
message = error.message;
} else {
message = String(error);
// Call countTokens without options to avoid processSpecialTokens overhead.
// Files with special token sequences (<|endoftext|> etc.) are rare (~0.1%)
// and handled via try-catch fallback.
return this.countFn(content);
} catch {
// Fallback: try with allowedSpecial for files containing special tokens
try {
const mod = encodingModules.get(this.encodingName);
if (mod) {
return mod(content);
}
} catch {
// ignore
}

if (filePath) {
logger.warn(`Failed to count tokens. path: ${filePath}, error: ${message}`);
logger.warn(`Failed to count tokens. path: ${filePath}`);
} else {
logger.warn(`Failed to count tokens. error: ${message}`);
logger.warn('Failed to count tokens.');
}

return 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The “fallback” path cannot recover from the original tokenizer error.

Lines 54-56 pull the same function back out of encodingModules and call it with the same input that already failed on Line 50. For any file that hits this path, we silently return 0, which undercounts totals and top-file metrics instead of handling the edge case described in the comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/metrics/TokenCounter.ts` around lines 46 - 68, The fallback
currently re-invokes the same tokenizer and returns 0, undercounting; update
TokenCounter so that when this.countFn(content) throws you retrieve the encoder
module from encodingModules.get(this.encodingName) and call the module's
token-counting API that enables special-token handling (e.g., the module's
countTokens or equivalent with allowedSpecial/processSpecialTokens options)
instead of calling the same failing function, capture and log the caught error
details via logger.warn including filePath, and only return 0 if the retry also
fails.

Comment on lines +93 to +111
// Start git diffs/logs in parallel with file collection - they only need rootDirs and config,
// not file contents, so there's no dependency between them
progressCallback('Collecting files and git info...');
const gitPromise = Promise.all([deps.getGitDiffs(rootDirs, config), deps.getGitLogs(rootDirs, config)]);

const collectResults = await withMemoryLogging(
'Collect Files',
async () =>
await Promise.all(
sortedFilePathsByDir.map(({ rootDir, filePaths }) =>
deps.collectFiles(filePaths, rootDir, config, progressCallback),
),
);
),
);

const rawFiles = collectResults.flatMap((curr) => curr.rawFiles);
const allSkippedFiles = collectResults.flatMap((curr) => curr.skippedFiles);
const rawFiles = collectResults.flatMap((curr) => curr.rawFiles);
const allSkippedFiles = collectResults.flatMap((curr) => curr.skippedFiles);

// Get git diffs if enabled - run this before security check
progressCallback('Getting git diffs...');
const gitDiffResult = await deps.getGitDiffs(rootDirs, config);
const [gitDiffResult, gitLogResult] = await gitPromise;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drain the background git work when collection fails.

Line 96 starts a rejecting gitPromise, but Line 111 is the first place it is awaited. If Line 101's collection Promise.all fails first, an error from getGitDiffs() or getGitLogs() is never observed and can become an unhandled rejection.

💡 One way to keep the overlap without orphaning rejections
-  const gitPromise = Promise.all([deps.getGitDiffs(rootDirs, config), deps.getGitLogs(rootDirs, config)]);
+  const gitPromise = Promise.allSettled([deps.getGitDiffs(rootDirs, config), deps.getGitLogs(rootDirs, config)]);
@@
-  const [gitDiffResult, gitLogResult] = await gitPromise;
+  const [gitDiffResultResult, gitLogResultResult] = await gitPromise;
+  if (gitDiffResultResult.status === 'rejected') throw gitDiffResultResult.reason;
+  if (gitLogResultResult.status === 'rejected') throw gitLogResultResult.reason;
+  const gitDiffResult = gitDiffResultResult.value;
+  const gitLogResult = gitLogResultResult.value;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/packager.ts` around lines 93 - 111, The background gitPromise
(created from deps.getGitDiffs and deps.getGitLogs) can reject while the file
collection (withMemoryLogging + Promise.all over deps.collectFiles) fails first,
producing an unhandled rejection; fix by ensuring gitPromise is always
awaited/consumed regardless of the collectResults outcome: wrap the collection
await in a try/finally (or attach a .catch) and in the finally await gitPromise
(or await gitPromise.catch(() => {/* swallow/log */})) so errors from
deps.getGitDiffs/deps.getGitLogs are observed; reference the gitPromise variable
and the deps.getGitDiffs/deps.getGitLogs calls and ensure you still use the
resolved [gitDiffResult, gitLogResult] after successful collection.

Comment on lines +8 to 12
const mockGetTokenCounter = async () => {
const counter = new TokenCounter('o200k_base');
await counter.init();
return counter;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

These assertions no longer prove the per-file token math.

mockGetTokenCounter ignores the requested encoding, and checking only > 0 means a regression that hard-codes the encoding or returns the same positive count for every file would still pass. Use a tiny fake counter or exact expected counts so this remains a real metrics test.

Also applies to: 29-30, 33-39

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/metrics/calculateSelectiveFileMetrics.test.ts` around lines 8 -
12, The current mockGetTokenCounter ignores the requested encoding and only
asserts counts are > 0, which doesn't verify per-file token math; replace
mockGetTokenCounter with a tiny deterministic fake token counter that respects
the encoding argument (or accepts encoding in its factory) and returns exact
token counts for the test strings (e.g., simple whitespace or character-based
counts) so the test can assert exact expected values instead of > 0; update the
assertions around mockGetTokenCounter and the tests referenced (including the
checks at lines with assertions 29-30 and 33-39) to compare against those exact
expected counts for each file rather than a generic positive check.

@yamadashy yamadashy closed this Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants