Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
"fast-xml-parser": "^5.4.1",
"git-url-parse": "^16.1.0",
"globby": "^16.1.1",
"gpt-tokenizer": "^3.4.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 README.md not updated after tiktoken → gpt-tokenizer migration (CONTRIBUTING.md violation)

CONTRIBUTING.md requires: "You have updated relevant documentation (especially README.md) if you've added or changed functionality." This PR replaces tiktoken with gpt-tokenizer but does not update the README.md, which still contains two now-incorrect references to tiktoken:

  • README.md:1360 describes tokenCount.encoding as using "OpenAI's tiktoken tokenizer" and links to tiktoken's GitHub/model.py.
  • README.md:1791 lists tiktoken as an external bundling dependency that "Loads WASM files dynamically at runtime" — but gpt-tokenizer is pure JavaScript and does not use WASM.

Both references are factually incorrect after this change and will mislead users.

Prompt for agents
Update README.md in two places to reflect the migration from tiktoken to gpt-tokenizer:

1. README.md line 1360: Change the tokenCount.encoding description from referencing tiktoken to referencing gpt-tokenizer. Replace the tiktoken links with appropriate gpt-tokenizer references. For example: "Token count encoding (e.g., o200k_base for GPT-4o, cl100k_base for GPT-4/3.5)."

2. README.md line 1791: Change the external bundling dependency from "tiktoken - Loads WASM files dynamically at runtime" to "gpt-tokenizer - Loads encoding data files at runtime" (since gpt-tokenizer is pure JS, not WASM-based).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

"handlebars": "^4.7.8",
"iconv-lite": "^0.7.0",
"is-binary-path": "^3.0.0",
Expand All @@ -97,7 +98,6 @@
"minimatch": "^10.2.4",
"picocolors": "^1.1.1",
"tar": "^7.5.9",
"tiktoken": "^1.0.22",
"tinypool": "^2.1.0",
"web-tree-sitter": "^0.26.6",
"zod": "^4.3.6"
Expand Down
7 changes: 2 additions & 5 deletions src/config/configSchema.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import type { TiktokenEncoding } from 'tiktoken';
import { z } from 'zod';
import { tokenEncodings } from '../core/metrics/tokenEncoding.js';

// Output style enum
export const repomixOutputStyleSchema = z.enum(['xml', 'markdown', 'json', 'plain']);
Expand Down Expand Up @@ -122,10 +122,7 @@ export const repomixConfigDefaultSchema = z.object({
enableSecurityCheck: z.boolean().default(true),
}),
tokenCount: z.object({
encoding: z
.string()
.default('o200k_base')
.transform((val) => val as TiktokenEncoding),
encoding: z.enum(tokenEncodings).default('o200k_base'),
}),
});

Expand Down
32 changes: 23 additions & 9 deletions src/core/metrics/TokenCounter.ts
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
import { get_encoding, type Tiktoken, type TiktokenEncoding } from 'tiktoken';
import { GptEncoding } from 'gpt-tokenizer/GptEncoding';
import { resolveEncodingAsync } from 'gpt-tokenizer/resolveEncodingAsync';
import { logger } from '../../shared/logger.js';
import type { TokenEncoding } from './tokenEncoding.js';
Comment thread
yamadashy marked this conversation as resolved.

export class TokenCounter {
private encoding: Tiktoken;
private encoding: GptEncoding;

constructor(encodingName: TiktokenEncoding) {
private constructor(encoding: GptEncoding) {
this.encoding = encoding;
}

/**
* Create a TokenCounter instance asynchronously.
* Uses dynamic import to load only the required BPE encoding data,
* avoiding the cost of loading all encodings (~4MB) on every worker.
*/
public static async create(encodingName: TokenEncoding): Promise<TokenCounter> {
const startTime = process.hrtime.bigint();

// Setup encoding with the specified model
this.encoding = get_encoding(encodingName);
const ranks = await resolveEncodingAsync(encodingName);
const encoding = GptEncoding.getEncodingApi(encodingName, () => ranks);

const endTime = process.hrtime.bigint();
const initTime = Number(endTime - startTime) / 1e6; // Convert to milliseconds

logger.debug(`TokenCounter initialization took ${initTime.toFixed(2)}ms`);

return new TokenCounter(encoding);
}

public countTokens(content: string, filePath?: string): number {
Expand All @@ -23,7 +36,7 @@ export class TokenCounter {
// This treats special tokens as ordinary text rather than control tokens,
// which is appropriate for general code/text analysis where we're not
// actually sending the content to an LLM API.
return this.encoding.encode(content, [], []).length;
return this.encoding.encode(content, { disallowedSpecial: new Set() }).length;
} catch (error) {
let message = '';
if (error instanceof Error) {
Expand All @@ -42,7 +55,8 @@ export class TokenCounter {
}
}

public free(): void {
this.encoding.free();
}
// No-op retained for public API backward compatibility.
// gpt-tokenizer is pure JavaScript — memory is managed by GC,
// unlike tiktoken which required explicit WASM resource cleanup.
public free(): void {}
}
4 changes: 2 additions & 2 deletions src/core/metrics/calculateOutputMetrics.ts
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
import type { TiktokenEncoding } from 'tiktoken';
import { logger } from '../../shared/logger.js';
import type { TaskRunner } from '../../shared/processConcurrency.js';
import type { TokenEncoding } from './tokenEncoding.js';
import type { TokenCountTask } from './workers/calculateMetricsWorker.js';

const CHUNK_SIZE = 1000;
const MIN_CONTENT_LENGTH_FOR_PARALLEL = 1_000_000; // 1000KB

export const calculateOutputMetrics = async (
content: string,
encoding: TiktokenEncoding,
encoding: TokenEncoding,
path: string | undefined,
deps: { taskRunner: TaskRunner<TokenCountTask, number> },
): Promise<number> => {
Expand Down
4 changes: 2 additions & 2 deletions src/core/metrics/calculateSelectiveFileMetrics.ts
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
import pc from 'picocolors';
import type { TiktokenEncoding } from 'tiktoken';
import { logger } from '../../shared/logger.js';
import type { TaskRunner } from '../../shared/processConcurrency.js';
import type { RepomixProgressCallback } from '../../shared/types.js';
import type { ProcessedFile } from '../file/fileTypes.js';
import type { TokenEncoding } from './tokenEncoding.js';
import type { TokenCountTask } from './workers/calculateMetricsWorker.js';
import type { FileMetrics } from './workers/types.js';

export const calculateSelectiveFileMetrics = async (
processedFiles: ProcessedFile[],
targetFilePaths: string[],
tokenCounterEncoding: TiktokenEncoding,
tokenCounterEncoding: TokenEncoding,
progressCallback: RepomixProgressCallback,
deps: { taskRunner: TaskRunner<TokenCountTask, number> },
): Promise<FileMetrics[]> => {
Expand Down
22 changes: 12 additions & 10 deletions src/core/metrics/tokenCounterFactory.ts
Original file line number Diff line number Diff line change
@@ -1,31 +1,33 @@
import type { TiktokenEncoding } from 'tiktoken';
import { logger } from '../../shared/logger.js';
import { TokenCounter } from './TokenCounter.js';
import type { TokenEncoding } from './tokenEncoding.js';

// Worker-level cache for TokenCounter instances by encoding
const tokenCounters = new Map<TiktokenEncoding, TokenCounter>();
const tokenCounters = new Map<TokenEncoding, TokenCounter>();

/**
* Get or create a TokenCounter instance for the given encoding.
* This ensures only one TokenCounter exists per encoding per worker thread to optimize memory usage.
*/
export const getTokenCounter = (encoding: TiktokenEncoding): TokenCounter => {
export const getTokenCounter = async (encoding: TokenEncoding): Promise<TokenCounter> => {
let tokenCounter = tokenCounters.get(encoding);
if (!tokenCounter) {
tokenCounter = new TokenCounter(encoding);
tokenCounters.set(encoding, tokenCounter);
tokenCounter = await TokenCounter.create(encoding);
// Guard against concurrent calls: only set if no other call populated the cache
if (!tokenCounters.has(encoding)) {
tokenCounters.set(encoding, tokenCounter);
} else {
tokenCounter = tokenCounters.get(encoding)!;
}
}
return tokenCounter;
Comment thread
yamadashy marked this conversation as resolved.
};

/**
* Free all TokenCounter resources and clear the cache.
* Clear all TokenCounter instances from the cache.
* This should be called when the worker is terminating.
*/
export const freeTokenCounters = (): void => {
for (const [encoding, tokenCounter] of tokenCounters.entries()) {
tokenCounter.free();
logger.debug(`Freed TokenCounter resources for encoding: ${encoding}`);
}
tokenCounters.clear();
logger.debug('Cleared TokenCounter cache');
};
14 changes: 14 additions & 0 deletions src/core/metrics/tokenEncoding.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
/**
* Supported token encoding names.
* These match the encoding names supported by gpt-tokenizer.
*/
export const tokenEncodings = [
'o200k_base',
'o200k_harmony',
'cl100k_base',
'p50k_base',
'p50k_edit',
'r50k_base',
] as const;

export type TokenEncoding = (typeof tokenEncodings)[number];
12 changes: 6 additions & 6 deletions src/core/metrics/workers/calculateMetricsWorker.ts
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
import type { TiktokenEncoding } from 'tiktoken';
import { logger, setLogLevelByWorkerData } from '../../../shared/logger.js';
import { freeTokenCounters, getTokenCounter } from '../tokenCounterFactory.js';
import type { TokenEncoding } from '../tokenEncoding.js';

/**
* Simple token counting worker for metrics calculation.
*
* This worker provides a focused interface for counting tokens from text content,
* using the Tiktoken encoding. All complex metric calculation logic is handled
* by the calling side to maintain separation of concerns.
* This worker provides a focused interface for counting tokens from text content.
* All complex metric calculation logic is handled by the calling side to maintain
* separation of concerns.
*/

// Initialize logger configuration from workerData at module load time
Expand All @@ -16,15 +16,15 @@ setLogLevelByWorkerData();

export interface TokenCountTask {
content: string;
encoding: TiktokenEncoding;
encoding: TokenEncoding;
path?: string;
}

export const countTokens = async (task: TokenCountTask): Promise<number> => {
const processStartAt = process.hrtime.bigint();

try {
const counter = getTokenCounter(task.encoding);
const counter = await getTokenCounter(task.encoding);
const tokenCount = counter.countTokens(task.content, task.path);

logger.trace(`Counted tokens. Count: ${tokenCount}. Took: ${getProcessDuration(processStartAt)}ms`);
Expand Down
Loading
Loading