perf(core): Replace tiktoken WASM with gpt-tokenizer and move token counting to main thread#1334
perf(core): Replace tiktoken WASM with gpt-tokenizer and move token counting to main thread#1334
Conversation
…ounting to main thread Replace tiktoken (WASM-based) with gpt-tokenizer (pure JavaScript) for token counting, and move all token counting from worker threads to the main thread. Key changes: - Replace tiktoken with gpt-tokenizer: Pure JS implementation eliminates WASM initialization overhead and native dependency complexity. Supports all encodings (o200k_base, cl100k_base, p50k_base, r50k_base) with identical token counts. - Move token counting to main thread: With pure JS tokenizer, worker thread overhead (structured clone serialization of file content, IPC, pool management) exceeds computation cost. Eliminates metrics worker pool entirely. - Remove metrics worker pool pre-warming: No longer needed without WASM init. - Cache secretlint config at worker module level: Config is stateless and identical for every task; lazy-initialized once per worker instead of recreated per task. - Simplify metrics API: All calculate*Metrics functions use direct TokenCounter via getTokenCounter() instead of taskRunner dependency. Benchmark (15 runs each, packing repomix repo): | Branch | Median | P25 | P75 | |----------------------|---------|---------|---------| | Before (tiktoken) | 2848ms | 2800ms | 2907ms | | After (gpt-tokenizer)| 2653ms | 2621ms | 2724ms | | Improvement | -195ms (-6.8%) | | | Additional benefits: - Removes native WASM dependency (better portability, simpler CI) - Reduces per-worker memory (no WASM allocation per thread) - Simpler architecture (no metrics worker pool lifecycle) - Net -702 lines of code https://claude.ai/code/session_01XSFjENwp9zyfRrTNtEMa2w
⚡ Performance Benchmark
Details
|
📝 WalkthroughWalkthroughThis PR replaces the Changes
Sequence DiagramsequenceDiagram
actor Client
participant PackagerModule as Packager
participant MetricsCalc as calculateMetrics
participant TokenFactory as getTokenCounter
participant TokenCounter as TokenCounter
participant Encoding as gpt-tokenizer/encoding
Client->>PackagerModule: pack(config, deps)
PackagerModule->>MetricsCalc: calculateMetrics(config, files, gitData)
rect rgba(100, 150, 200, 0.5)
Note over MetricsCalc: Process selective files
MetricsCalc->>TokenFactory: getTokenCounter('o200k_base')
TokenFactory->>TokenCounter: new TokenCounter('o200k_base')
TokenFactory->>TokenCounter: init()
TokenCounter->>Encoding: import encoding module
Encoding-->>TokenCounter: countTokens function
TokenCounter-->>TokenFactory: initialized instance
TokenFactory-->>MetricsCalc: cached TokenCounter
end
rect rgba(100, 150, 200, 0.5)
Note over MetricsCalc: Process output metrics
MetricsCalc->>TokenFactory: getTokenCounter('o200k_base')
TokenFactory-->>MetricsCalc: cached TokenCounter (no reinit)
MetricsCalc->>TokenCounter: countTokens(content)
TokenCounter-->>MetricsCalc: token count
end
rect rgba(100, 150, 200, 0.5)
Note over MetricsCalc: Process git metrics
par Git Diffs
MetricsCalc->>TokenCounter: countTokens(diffContent)
and Git Logs
MetricsCalc->>TokenCounter: countTokens(logContent)
end
end
MetricsCalc-->>PackagerModule: metrics result
PackagerModule-->>Client: packaged output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1334 +/- ##
==========================================
- Coverage 87.13% 86.63% -0.50%
==========================================
Files 115 115
Lines 4367 4369 +2
Branches 1015 1015
==========================================
- Hits 3805 3785 -20
- Misses 562 584 +22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Deploying repomix with
|
| Latest commit: |
ff21257
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://57fd9a7f.repomix.pages.dev |
| Branch Preview URL: | https://perf-replace-tiktoken-wasm.repomix.pages.dev |
There was a problem hiding this comment.
Code Review
This pull request replaces the tiktoken library with gpt-tokenizer and migrates token counting from a worker pool to the main thread. It introduces lazy loading for encoding modules and optimizes the packaging process by parallelizing git operations with file collection. Feedback indicates that the fallback logic for special tokens in TokenCounter is incorrectly implemented and suggests removing the now-obsolete metrics worker directory.
| } catch { | ||
| // Fallback: try with allowedSpecial for files containing special tokens | ||
| try { | ||
| const mod = encodingModules.get(this.encodingName); | ||
| if (mod) { | ||
| return mod(content); | ||
| } | ||
| } catch { | ||
| // ignore | ||
| } | ||
|
|
||
| if (filePath) { | ||
| logger.warn(`Failed to count tokens. path: ${filePath}, error: ${message}`); | ||
| logger.warn(`Failed to count tokens. path: ${filePath}`); | ||
| } else { | ||
| logger.warn(`Failed to count tokens. error: ${message}`); | ||
| logger.warn('Failed to count tokens.'); | ||
| } | ||
|
|
||
| return 0; | ||
| } |
There was a problem hiding this comment.
The fallback logic for handling special tokens is not implemented as the comment suggests. The current implementation re-attempts the same failing countTokens function, which will throw an error again. This error is then silently ignored by the inner catch block, causing the function to incorrectly return 0 for files that contain special tokens.
To correctly implement the fallback, you should use the encode function from gpt-tokenizer with the { allowedSpecial: 'all' } option. This requires a small refactor to make the encode function available synchronously within countTokens.
I recommend updating loadEncoding to cache and return an object containing both countTokens and encode functions. This object can then be stored in the TokenCounter instance during init() and used in the fallback path.
} catch (initialError) {
// Fallback: try with allowedSpecial for files containing special tokens
try {
// This assumes `this.tokenizer.encode` is available on the class instance,
// loaded and cached during `init()`.
return this.tokenizer.encode(content, { allowedSpecial: 'all' }).length;
} catch (fallbackError) {
// If the fallback also fails, log both errors for debugging.
const pathInfo = filePath ? `path: ${filePath}, ` : '';
logger.warn(
`Failed to count tokens. ${pathInfo}Initial error: ${initialError}, Fallback error: ${fallbackError}`,
);
}
return 0;
}| import { logger, setLogLevelByWorkerData } from '../../../shared/logger.js'; | ||
| import type { TokenEncoding } from '../TokenCounter.js'; | ||
| import { freeTokenCounters, getTokenCounter } from '../tokenCounterFactory.js'; | ||
|
|
||
| /** | ||
| * Simple token counting worker for metrics calculation. | ||
| * | ||
| * This worker provides a focused interface for counting tokens from text content, | ||
| * using the Tiktoken encoding. All complex metric calculation logic is handled | ||
| * using gpt-tokenizer. All complex metric calculation logic is handled | ||
| * by the calling side to maintain separation of concerns. | ||
| */ |
There was a problem hiding this comment.
Given that this pull request's summary states it 'Eliminates metrics worker pool entirely', this worker file and its containing directory (src/core/metrics/workers) appear to be obsolete.
To fully align with the new single-threaded architecture for metrics calculation and simplify the codebase, please consider removing this file and the workers directory.
| } catch { | ||
| // Fallback: try with allowedSpecial for files containing special tokens | ||
| try { | ||
| const mod = encodingModules.get(this.encodingName); | ||
| if (mod) { | ||
| return mod(content); | ||
| } |
There was a problem hiding this comment.
🟡 Fallback for special tokens retries with identical call, never succeeding
In TokenCounter.countTokens, the outer catch block (line 51) is meant to handle files containing special token sequences like <|endoftext|> by retrying with allowedSpecial. However, the fallback on lines 53-56 retrieves the exact same function from encodingModules (stored at src/core/metrics/TokenCounter.ts:20 during loadEncoding) and calls it with the exact same content argument and no options. Since mod === this.countFn, the retry will always throw the same error.
The old tiktoken code handled this correctly: this.encoding.encode(content, [], []) — the empty arrays disabled special token validation. The gpt-tokenizer equivalent would be passing { allowedSpecial: 'all' } as the second argument to countTokens. Without this, any file containing special token sequences (e.g., tokenizer config files with <|endoftext|>) will silently return 0 tokens instead of being counted correctly.
Prompt for agents
In src/core/metrics/TokenCounter.ts, the catch block at lines 51-60 needs to be fixed. The fallback retry currently calls the same countTokens function with the same arguments, which will always fail identically.
The fix: import the ALL_SPECIAL_TOKENS constant from gpt-tokenizer (exported as a named export from gpt-tokenizer/encoding/<name>) and use it in the fallback call. The inner try block (lines 53-56) should pass allowedSpecial option to the countFn call, like:
return mod(content, { allowedSpecial: new Set(['all']) });
or, if gpt-tokenizer exports an ALL_SPECIAL_TOKENS sentinel:
import { AllSpecialTokens } from 'gpt-tokenizer';
return mod(content, { allowedSpecial: AllSpecialTokens });
You may need to also update the type of countFn in the Map and loadEncoding to accept the second encodeOptions parameter, e.g.:
type CountFn = (text: string, options?: { allowedSpecial?: Set<string> | 'all' }) => number;
This ensures files containing special token sequences like <|endoftext|> get their tokens counted correctly instead of silently returning 0.
Was this helpful? React with 👍 or 👎 to provide feedback.
🔍 Code Review — PR #1334Great PR! Replacing tiktoken WASM with gpt-tokenizer is a clean simplification that removes native dependency complexity and improves performance. The net -705 lines and benchmark improvements are compelling. A few items worth addressing: 1. 🐛 Dead fallback in
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
src/core/metrics/workers/calculateMetricsWorker.ts (1)
17-28: This keeps the old token-count worker path alive.
src/shared/unifiedWorker.ts:52-56andsrc/shared/unifiedWorker.ts:98-101still route{ content, encoding }tasks here, so this module preserves a second execution path even though the rest of the refactor moved metrics to the main thread. Please verify that no producer still emitsTokenCountTask, or remove this registration so the main-thread path is the only supported one.Also applies to: 43-50
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/metrics/workers/calculateMetricsWorker.ts` around lines 17 - 28, The TokenCountTask/countTokens worker still remains registered and is being routed from unifiedWorker, preserving an old execution path; either remove the worker registration and any exports like TokenCountTask/countTokens so only the main-thread metric path is used, or find and update all producers that emit TokenCountTask to instead use the main-thread API (the unifiedWorker routing locations around the code that sends { content, encoding }), ensuring no code still produces TokenCountTask; update or delete the registration in this module accordingly and run a project-wide search for TokenCountTask/countTokens to confirm no producers remain.src/core/metrics/tokenCounterFactory.ts (1)
12-17: Cache the in-flight initialization, not only the finished counter.Because Line 16 awaits
init()before Line 17 stores anything intokenCounters, two concurrent calls for the same encoding can each allocate and initialize their ownTokenCounter. That breaks the “one counter per encoding” guarantee and gives back some of the perf/memory win on parallel paths.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/metrics/tokenCounterFactory.ts` around lines 12 - 17, getTokenCounter currently waits for tokenCounter.init() before storing the instance in tokenCounters, allowing concurrent callers to each allocate/init their own TokenCounter; change to cache the in-flight initialization so only one init runs per encoding: when no entry exists in tokenCounters, create a placeholder (e.g., store a Promise or a wrapper) before awaiting init, set that placeholder into tokenCounters immediately, await the init on that stored value, and ensure subsequent calls to getTokenCounter return the same stored Promise/instance; refer to getTokenCounter, tokenCounters, TokenCounter, and TokenCounter.init when making the change.tests/core/metrics/TokenCounter.test.ts (1)
19-71: The new tests are too permissive for a tokenizer swap.Every non-empty case only asserts
> 0, so loading the wrong encoding module or returning a constant positive number would still pass. Please pin a small golden corpus with explicit counts—ideally including a special-token example—so the parity claim in this PR is actually regression-tested.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/metrics/TokenCounter.test.ts` around lines 19 - 71, The tests are too permissive—every non-empty input just asserts > 0—so update the TokenCounter.test.ts cases to use a small golden corpus with deterministic expected token counts: replace the loose assertions in tests that call tokenCounter.countTokens (e.g., the "should correctly count tokens for simple text", "multi-line text", "special characters", "unicode characters", "code snippets", "markdown text", and "very long text" cases) with explicit expect(...).toBe(<expectedCount>) assertions using known counts for each sample string (include at least one sample that contains a special token/example from your tokenizer to assert its exact tokenization), so the tests will fail if the encoding module or tokenization logic changes unexpectedly.tests/core/metrics/calculateGitDiffMetrics.test.ts (1)
108-117: These assertions no longer verify the aggregation logic.
toBeGreaterThan(0)still passes ifcalculateGitDiffMetrics()accidentally drops eitherworkTreeDiffContentorstagedDiffContent. Computing the expected total with the same initialized counter, or injecting a trivial stub counter, keeps the test decoupled from tokenizer internals while still checking the sum.✅ Example of a stronger assertion for the two-diff case
- expect(result).toBeGreaterThan(0); + const counter = await mockGetTokenCounter(); + expect(result).toBe( + counter.countTokens('work tree changes') + counter.countTokens('staged changes'), + );Also applies to: 120-129, 132-141, 144-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/metrics/calculateGitDiffMetrics.test.ts` around lines 108 - 117, The test currently only asserts result > 0 which doesn't verify aggregation; instead instantiate or stub the token counters used by mockGetTokenCounter to produce deterministic counts for each diff (e.g., a counter returning N tokens for workTreeDiffContent and M tokens for stagedDiffContent), compute expectedTotal = N + M and assert calculateGitDiffMetrics(..., { getTokenCounter: mockGetTokenCounter }) === expectedTotal; apply the same stronger assertion pattern to the other test cases that cover single/dual diffs (the tests around the blocks referenced by this comment) so they verify explicit sums rather than just > 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/config/configSchema.ts`:
- Around line 125-128: The schema currently uses encoding:
z.string().default('o200k_base').transform((val) => val as TokenEncoding) which
blindly casts any string to TokenEncoding; change the schema to explicitly
validate allowed encodings (e.g., use z.enum(...) or z.union of z.literal(...))
listing the supported TokenEncoding values, remove the unsafe transform/cast,
and ensure the default is one of those literals so invalid encoding strings are
rejected during config parsing (this prevents later failures in
getTokenCounter).
In `@src/core/metrics/calculateGitLogMetrics.ts`:
- Around line 40-42: The catch block in calculateGitLogMetrics currently logs
the error and returns { gitLogTokenCount: 0 }, which masks tokenizer failures
and undercounts packed output; change this to rethrow the error (or throw a new
Error with context) instead of returning zero so callers can handle failures
consistently (same behavior as calculateGitDiffMetrics). Keep the logger.error
call for diagnostics but remove the return and throw the original error (or a
wrapped error) from the catch in calculateGitLogMetrics.
In `@src/core/metrics/TokenCounter.ts`:
- Around line 46-68: The fallback currently re-invokes the same tokenizer and
returns 0, undercounting; update TokenCounter so that when this.countFn(content)
throws you retrieve the encoder module from
encodingModules.get(this.encodingName) and call the module's token-counting API
that enables special-token handling (e.g., the module's countTokens or
equivalent with allowedSpecial/processSpecialTokens options) instead of calling
the same failing function, capture and log the caught error details via
logger.warn including filePath, and only return 0 if the retry also fails.
In `@src/core/packager.ts`:
- Around line 93-111: The background gitPromise (created from deps.getGitDiffs
and deps.getGitLogs) can reject while the file collection (withMemoryLogging +
Promise.all over deps.collectFiles) fails first, producing an unhandled
rejection; fix by ensuring gitPromise is always awaited/consumed regardless of
the collectResults outcome: wrap the collection await in a try/finally (or
attach a .catch) and in the finally await gitPromise (or await
gitPromise.catch(() => {/* swallow/log */})) so errors from
deps.getGitDiffs/deps.getGitLogs are observed; reference the gitPromise variable
and the deps.getGitDiffs/deps.getGitLogs calls and ensure you still use the
resolved [gitDiffResult, gitLogResult] after successful collection.
In `@tests/core/metrics/calculateSelectiveFileMetrics.test.ts`:
- Around line 8-12: The current mockGetTokenCounter ignores the requested
encoding and only asserts counts are > 0, which doesn't verify per-file token
math; replace mockGetTokenCounter with a tiny deterministic fake token counter
that respects the encoding argument (or accepts encoding in its factory) and
returns exact token counts for the test strings (e.g., simple whitespace or
character-based counts) so the test can assert exact expected values instead of
> 0; update the assertions around mockGetTokenCounter and the tests referenced
(including the checks at lines with assertions 29-30 and 33-39) to compare
against those exact expected counts for each file rather than a generic positive
check.
---
Nitpick comments:
In `@src/core/metrics/tokenCounterFactory.ts`:
- Around line 12-17: getTokenCounter currently waits for tokenCounter.init()
before storing the instance in tokenCounters, allowing concurrent callers to
each allocate/init their own TokenCounter; change to cache the in-flight
initialization so only one init runs per encoding: when no entry exists in
tokenCounters, create a placeholder (e.g., store a Promise or a wrapper) before
awaiting init, set that placeholder into tokenCounters immediately, await the
init on that stored value, and ensure subsequent calls to getTokenCounter return
the same stored Promise/instance; refer to getTokenCounter, tokenCounters,
TokenCounter, and TokenCounter.init when making the change.
In `@src/core/metrics/workers/calculateMetricsWorker.ts`:
- Around line 17-28: The TokenCountTask/countTokens worker still remains
registered and is being routed from unifiedWorker, preserving an old execution
path; either remove the worker registration and any exports like
TokenCountTask/countTokens so only the main-thread metric path is used, or find
and update all producers that emit TokenCountTask to instead use the main-thread
API (the unifiedWorker routing locations around the code that sends { content,
encoding }), ensuring no code still produces TokenCountTask; update or delete
the registration in this module accordingly and run a project-wide search for
TokenCountTask/countTokens to confirm no producers remain.
In `@tests/core/metrics/calculateGitDiffMetrics.test.ts`:
- Around line 108-117: The test currently only asserts result > 0 which doesn't
verify aggregation; instead instantiate or stub the token counters used by
mockGetTokenCounter to produce deterministic counts for each diff (e.g., a
counter returning N tokens for workTreeDiffContent and M tokens for
stagedDiffContent), compute expectedTotal = N + M and assert
calculateGitDiffMetrics(..., { getTokenCounter: mockGetTokenCounter }) ===
expectedTotal; apply the same stronger assertion pattern to the other test cases
that cover single/dual diffs (the tests around the blocks referenced by this
comment) so they verify explicit sums rather than just > 0.
In `@tests/core/metrics/TokenCounter.test.ts`:
- Around line 19-71: The tests are too permissive—every non-empty input just
asserts > 0—so update the TokenCounter.test.ts cases to use a small golden
corpus with deterministic expected token counts: replace the loose assertions in
tests that call tokenCounter.countTokens (e.g., the "should correctly count
tokens for simple text", "multi-line text", "special characters", "unicode
characters", "code snippets", "markdown text", and "very long text" cases) with
explicit expect(...).toBe(<expectedCount>) assertions using known counts for
each sample string (include at least one sample that contains a special
token/example from your tokenizer to assert its exact tokenization), so the
tests will fail if the encoding module or tokenization logic changes
unexpectedly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 63ed3617-8c29-4a0e-bbe0-14b3ffd966a7
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (23)
package.jsonsrc/config/configSchema.tssrc/core/metrics/TokenCounter.tssrc/core/metrics/calculateGitDiffMetrics.tssrc/core/metrics/calculateGitLogMetrics.tssrc/core/metrics/calculateMetrics.tssrc/core/metrics/calculateOutputMetrics.tssrc/core/metrics/calculateSelectiveFileMetrics.tssrc/core/metrics/tokenCounterFactory.tssrc/core/metrics/workers/calculateMetricsWorker.tssrc/core/packager.tssrc/core/security/workers/securityCheckWorker.tstests/core/metrics/TokenCounter.test.tstests/core/metrics/calculateGitDiffMetrics.test.tstests/core/metrics/calculateGitLogMetrics.test.tstests/core/metrics/calculateMetrics.test.tstests/core/metrics/calculateOutputMetrics.test.tstests/core/metrics/calculateSelectiveFileMetrics.test.tstests/core/metrics/diffTokenCount.test.tstests/core/packager.test.tstests/core/packager/diffsFunctionality.test.tstests/core/packager/splitOutput.test.tstests/integration-tests/packager.test.ts
💤 Files with no reviewable changes (6)
- tests/core/packager.test.ts
- tests/core/metrics/calculateMetrics.test.ts
- tests/integration-tests/packager.test.ts
- tests/core/packager/diffsFunctionality.test.ts
- tests/core/packager/splitOutput.test.ts
- tests/core/metrics/diffTokenCount.test.ts
| encoding: z | ||
| .string() | ||
| .default('o200k_base') | ||
| .transform((val) => val as TiktokenEncoding), | ||
| .transform((val) => val as TokenEncoding), |
There was a problem hiding this comment.
Validate supported encodings here instead of casting.
z.string().default(...).transform((val) => val as TokenEncoding) still accepts any arbitrary string, so a typo survives config parsing and only fails later when getTokenCounter initializes. Since this PR only supports a fixed encoding set, reject invalid values in the schema.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/config/configSchema.ts` around lines 125 - 128, The schema currently uses
encoding: z.string().default('o200k_base').transform((val) => val as
TokenEncoding) which blindly casts any string to TokenEncoding; change the
schema to explicitly validate allowed encodings (e.g., use z.enum(...) or
z.union of z.literal(...)) listing the supported TokenEncoding values, remove
the unsafe transform/cast, and ensure the default is one of those literals so
invalid encoding strings are rejected during config parsing (this prevents later
failures in getTokenCounter).
| } catch (error) { | ||
| logger.error('Failed to calculate git log metrics:', error); | ||
| return { | ||
| gitLogTokenCount: 0, | ||
| }; | ||
| return { gitLogTokenCount: 0 }; |
There was a problem hiding this comment.
Don't turn tokenizer failures into a zero-token success.
When includeLogs is enabled, returning { gitLogTokenCount: 0 } here undercounts the actual packed output and can break split/limit decisions. This is also inconsistent with src/core/metrics/calculateGitDiffMetrics.ts, which rethrows the same class of failures.
🛠️ Suggested change
} catch (error) {
logger.error('Failed to calculate git log metrics:', error);
- return { gitLogTokenCount: 0 };
+ throw error;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } catch (error) { | |
| logger.error('Failed to calculate git log metrics:', error); | |
| return { | |
| gitLogTokenCount: 0, | |
| }; | |
| return { gitLogTokenCount: 0 }; | |
| } catch (error) { | |
| logger.error('Failed to calculate git log metrics:', error); | |
| throw error; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/metrics/calculateGitLogMetrics.ts` around lines 40 - 42, The catch
block in calculateGitLogMetrics currently logs the error and returns {
gitLogTokenCount: 0 }, which masks tokenizer failures and undercounts packed
output; change this to rethrow the error (or throw a new Error with context)
instead of returning zero so callers can handle failures consistently (same
behavior as calculateGitDiffMetrics). Keep the logger.error call for diagnostics
but remove the return and throw the original error (or a wrapped error) from the
catch in calculateGitLogMetrics.
| try { | ||
| // Disable special token validation to handle files that may contain | ||
| // special token sequences (e.g., tokenizer configs with <|endoftext|>). | ||
| // This treats special tokens as ordinary text rather than control tokens, | ||
| // which is appropriate for general code/text analysis where we're not | ||
| // actually sending the content to an LLM API. | ||
| return this.encoding.encode(content, [], []).length; | ||
| } catch (error) { | ||
| let message = ''; | ||
| if (error instanceof Error) { | ||
| message = error.message; | ||
| } else { | ||
| message = String(error); | ||
| // Call countTokens without options to avoid processSpecialTokens overhead. | ||
| // Files with special token sequences (<|endoftext|> etc.) are rare (~0.1%) | ||
| // and handled via try-catch fallback. | ||
| return this.countFn(content); | ||
| } catch { | ||
| // Fallback: try with allowedSpecial for files containing special tokens | ||
| try { | ||
| const mod = encodingModules.get(this.encodingName); | ||
| if (mod) { | ||
| return mod(content); | ||
| } | ||
| } catch { | ||
| // ignore | ||
| } | ||
|
|
||
| if (filePath) { | ||
| logger.warn(`Failed to count tokens. path: ${filePath}, error: ${message}`); | ||
| logger.warn(`Failed to count tokens. path: ${filePath}`); | ||
| } else { | ||
| logger.warn(`Failed to count tokens. error: ${message}`); | ||
| logger.warn('Failed to count tokens.'); | ||
| } | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
The “fallback” path cannot recover from the original tokenizer error.
Lines 54-56 pull the same function back out of encodingModules and call it with the same input that already failed on Line 50. For any file that hits this path, we silently return 0, which undercounts totals and top-file metrics instead of handling the edge case described in the comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/metrics/TokenCounter.ts` around lines 46 - 68, The fallback
currently re-invokes the same tokenizer and returns 0, undercounting; update
TokenCounter so that when this.countFn(content) throws you retrieve the encoder
module from encodingModules.get(this.encodingName) and call the module's
token-counting API that enables special-token handling (e.g., the module's
countTokens or equivalent with allowedSpecial/processSpecialTokens options)
instead of calling the same failing function, capture and log the caught error
details via logger.warn including filePath, and only return 0 if the retry also
fails.
| // Start git diffs/logs in parallel with file collection - they only need rootDirs and config, | ||
| // not file contents, so there's no dependency between them | ||
| progressCallback('Collecting files and git info...'); | ||
| const gitPromise = Promise.all([deps.getGitDiffs(rootDirs, config), deps.getGitLogs(rootDirs, config)]); | ||
|
|
||
| const collectResults = await withMemoryLogging( | ||
| 'Collect Files', | ||
| async () => | ||
| await Promise.all( | ||
| sortedFilePathsByDir.map(({ rootDir, filePaths }) => | ||
| deps.collectFiles(filePaths, rootDir, config, progressCallback), | ||
| ), | ||
| ); | ||
| ), | ||
| ); | ||
|
|
||
| const rawFiles = collectResults.flatMap((curr) => curr.rawFiles); | ||
| const allSkippedFiles = collectResults.flatMap((curr) => curr.skippedFiles); | ||
| const rawFiles = collectResults.flatMap((curr) => curr.rawFiles); | ||
| const allSkippedFiles = collectResults.flatMap((curr) => curr.skippedFiles); | ||
|
|
||
| // Get git diffs if enabled - run this before security check | ||
| progressCallback('Getting git diffs...'); | ||
| const gitDiffResult = await deps.getGitDiffs(rootDirs, config); | ||
| const [gitDiffResult, gitLogResult] = await gitPromise; |
There was a problem hiding this comment.
Drain the background git work when collection fails.
Line 96 starts a rejecting gitPromise, but Line 111 is the first place it is awaited. If Line 101's collection Promise.all fails first, an error from getGitDiffs() or getGitLogs() is never observed and can become an unhandled rejection.
💡 One way to keep the overlap without orphaning rejections
- const gitPromise = Promise.all([deps.getGitDiffs(rootDirs, config), deps.getGitLogs(rootDirs, config)]);
+ const gitPromise = Promise.allSettled([deps.getGitDiffs(rootDirs, config), deps.getGitLogs(rootDirs, config)]);
@@
- const [gitDiffResult, gitLogResult] = await gitPromise;
+ const [gitDiffResultResult, gitLogResultResult] = await gitPromise;
+ if (gitDiffResultResult.status === 'rejected') throw gitDiffResultResult.reason;
+ if (gitLogResultResult.status === 'rejected') throw gitLogResultResult.reason;
+ const gitDiffResult = gitDiffResultResult.value;
+ const gitLogResult = gitLogResultResult.value;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/packager.ts` around lines 93 - 111, The background gitPromise
(created from deps.getGitDiffs and deps.getGitLogs) can reject while the file
collection (withMemoryLogging + Promise.all over deps.collectFiles) fails first,
producing an unhandled rejection; fix by ensuring gitPromise is always
awaited/consumed regardless of the collectResults outcome: wrap the collection
await in a try/finally (or attach a .catch) and in the finally await gitPromise
(or await gitPromise.catch(() => {/* swallow/log */})) so errors from
deps.getGitDiffs/deps.getGitLogs are observed; reference the gitPromise variable
and the deps.getGitDiffs/deps.getGitLogs calls and ensure you still use the
resolved [gitDiffResult, gitLogResult] after successful collection.
| const mockGetTokenCounter = async () => { | ||
| const counter = new TokenCounter('o200k_base'); | ||
| await counter.init(); | ||
| return counter; | ||
| }; |
There was a problem hiding this comment.
These assertions no longer prove the per-file token math.
mockGetTokenCounter ignores the requested encoding, and checking only > 0 means a regression that hard-codes the encoding or returns the same positive count for every file would still pass. Use a tiny fake counter or exact expected counts so this remains a real metrics test.
Also applies to: 29-30, 33-39
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/core/metrics/calculateSelectiveFileMetrics.test.ts` around lines 8 -
12, The current mockGetTokenCounter ignores the requested encoding and only
asserts counts are > 0, which doesn't verify per-file token math; replace
mockGetTokenCounter with a tiny deterministic fake token counter that respects
the encoding argument (or accepts encoding in its factory) and returns exact
token counts for the test strings (e.g., simple whitespace or character-based
counts) so the test can assert exact expected values instead of > 0; update the
assertions around mockGetTokenCounter and the tests referenced (including the
checks at lines with assertions 29-30 and 33-39) to compare against those exact
expected counts for each file rather than a generic positive check.
Summary
Replace tiktoken (WASM-based) with gpt-tokenizer (pure JavaScript) for token counting, and move all token counting from worker threads to the main thread.
Benchmark results (15 runs each, packing repomix repo)
Additional benefits
Checklist
npm run testnpm run lint