Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 26 additions & 13 deletions src/core/file/fileRead.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,20 @@
import * as fs from 'node:fs/promises';
import iconv from 'iconv-lite';
import isBinaryPath from 'is-binary-path';
import { isBinaryFile } from 'isbinaryfile';
import jschardet from 'jschardet';
import { logger } from '../../shared/logger.js';

// Lazy-load encoding detection libraries to avoid their ~25ms combined import cost.
// The fast UTF-8 path (covers ~99% of source code files) never needs these;
// they are only loaded when a file fails UTF-8 decoding.
let _jschardet: typeof import('jschardet') | undefined;
let _iconv: typeof import('iconv-lite') | undefined;
const getEncodingDeps = async () => {
if (!_jschardet || !_iconv) {
[_jschardet, _iconv] = await Promise.all([import('jschardet'), import('iconv-lite')]);
}
return { jschardet: _jschardet, iconv: _iconv };
};
Comment on lines +9 to +16
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of getEncodingDeps has a potential race condition. If multiple concurrent calls to readRawFile trigger the slow path (non-UTF-8 files) before the first import() completes, the modules will be imported multiple times. Additionally, the return type of the function includes undefined for both dependencies, which will cause TypeScript errors or runtime crashes when calling encodingDeps.jschardet.detect() at line 73.

Using a single promise to manage the lazy loading ensures that the modules are only loaded once and provides a clean, non-nullable return type for the caller.

let encodingDepsPromise: Promise<{
  jschardet: typeof import('jschardet');
  iconv: typeof import('iconv-lite');
}> | undefined;

const getEncodingDeps = async () => {
  if (!encodingDepsPromise) {
    encodingDepsPromise = Promise.all([
      import('jschardet'),
      import('iconv-lite'),
    ]).then(([jschardet, iconv]) => ({
      jschardet: (jschardet as any).default || jschardet,
      iconv: (iconv as any).default || iconv,
    }));
  }
  return encodingDepsPromise;
};


export type FileSkipReason = 'binary-extension' | 'binary-content' | 'size-limit' | 'encoding-error';

export interface FileReadResult {
Expand All @@ -20,25 +30,26 @@ export interface FileReadResult {
*/
export const readRawFile = async (filePath: string, maxFileSize: number): Promise<FileReadResult> => {
try {
// Check binary extension first (no I/O needed) to skip stat + read for binary files
// Check binary extension first (no I/O needed) to skip read for binary files
if (isBinaryPath(filePath)) {
logger.debug(`Skipping binary file: ${filePath}`);
return { content: null, skippedReason: 'binary-extension' };
}

const stats = await fs.stat(filePath);
logger.trace(`Reading file: ${filePath}`);

// Read the file directly and check size afterward, avoiding a separate stat() syscall.
// This halves the number of I/O operations per file.
// Files exceeding maxFileSize are rare, so the occasional oversized read is acceptable.
const buffer = await fs.readFile(filePath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the fs.stat check before fs.readFile introduces a risk of high memory consumption. While it reduces the number of syscalls, any file exceeding maxFileSize (e.g., a large log file, dataset, or binary not caught by extension checks) will now be fully loaded into memory before the size check occurs. In extreme cases, this could lead to OutOfMemoryError or process crashes if the file is very large (e.g., several GBs).

Consider if the performance gain of one less stat syscall (which is metadata-only and very fast) outweighs the safety of the early size check, especially for a tool that might be run on diverse and potentially large repositories.


if (stats.size > maxFileSize) {
const sizeKB = (stats.size / 1024).toFixed(1);
if (buffer.length > maxFileSize) {
const sizeKB = (buffer.length / 1024).toFixed(1);
const maxSizeKB = (maxFileSize / 1024).toFixed(1);
logger.trace(`File exceeds size limit: ${sizeKB}KB > ${maxSizeKB}KB (${filePath})`);
return { content: null, skippedReason: 'size-limit' };
}

logger.trace(`Reading file: ${filePath}`);

const buffer = await fs.readFile(filePath);

if (await isBinaryFile(buffer)) {
logger.debug(`Skipping binary file (content check): ${filePath}`);
return { content: null, skippedReason: 'binary-content' };
Expand All @@ -58,9 +69,11 @@ export const readRawFile = async (filePath: string, maxFileSize: number): Promis
}

// Slow path: Detect encoding with jschardet for non-UTF-8 files (e.g., Shift-JIS, EUC-KR)
const { encoding: detectedEncoding } = jschardet.detect(buffer) ?? {};
const encoding = detectedEncoding && iconv.encodingExists(detectedEncoding) ? detectedEncoding : 'utf-8';
const content = iconv.decode(buffer, encoding, { stripBOM: true });
const encodingDeps = await getEncodingDeps();
const { encoding: detectedEncoding } = encodingDeps.jschardet.detect(buffer) ?? {};
const encoding =
detectedEncoding && encodingDeps.iconv.encodingExists(detectedEncoding) ? detectedEncoding : 'utf-8';
const content = encodingDeps.iconv.decode(buffer, encoding, { stripBOM: true });

if (content.includes('\uFFFD')) {
logger.debug(`Skipping file due to encoding errors (detected: ${encoding}): ${filePath}`);
Expand Down
Loading