Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 180 additions & 21 deletions src/core/file/fileSearch.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
import { execFile } from 'node:child_process';
import type { Stats } from 'node:fs';
import { lstatSync } from 'node:fs';
import fs from 'node:fs/promises';
import path from 'node:path';
import { promisify } from 'node:util';
import { type Options as GlobbyOptions, globby } from 'globby';
import ignore from 'ignore';
import { Minimatch } from 'minimatch';
import type { RepomixConfigMerged } from '../../config/configSchema.js';
import { defaultIgnoreList } from '../../config/defaultIgnore.js';
import { RepomixError } from '../../shared/errorHandle.js';
Expand All @@ -10,6 +15,8 @@ import { sortPaths } from './filePathSort.js';

import { checkDirectoryPermissions, PermissionError } from './permissionCheck.js';

const execFileAsync = promisify(execFile);

export interface FileSearchResult {
filePaths: string[];
emptyDirPaths: string[];
Expand Down Expand Up @@ -85,6 +92,113 @@ export const normalizeGlobPattern = (pattern: string): string => {
return pattern;
};

/**
* Fast file enumeration using `git ls-files` for git repositories.
*
* `git ls-files --cached --others --exclude-standard` reads from the
* pre-built git index (~5ms) instead of walking the filesystem (~250ms
* with globby). The result is then post-filtered through the `ignore`
* package (for default/custom/repomixignore patterns) and `picomatch`
* (for include patterns) to produce the same file set as globby.
*
* Returns `null` when the fast path is not applicable (not a git repo,
* `useGitignore` disabled, or git command failure), signalling the
* caller to fall back to globby.
*/
Comment on lines +95 to +107
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring says picomatch, but the implementation uses minimatch.

Small inconsistency in the new docstring — line 101 references picomatch, while the actual matcher below (line 195) is new Minimatch(...). Please update the comment to avoid future confusion.

- * package (for default/custom/repomixignore patterns) and `picomatch`
+ * package (for default/custom/repomixignore patterns) and `minimatch`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Fast file enumeration using `git ls-files` for git repositories.
*
* `git ls-files --cached --others --exclude-standard` reads from the
* pre-built git index (~5ms) instead of walking the filesystem (~250ms
* with globby). The result is then post-filtered through the `ignore`
* package (for default/custom/repomixignore patterns) and `picomatch`
* (for include patterns) to produce the same file set as globby.
*
* Returns `null` when the fast path is not applicable (not a git repo,
* `useGitignore` disabled, or git command failure), signalling the
* caller to fall back to globby.
*/
/**
* Fast file enumeration using `git ls-files` for git repositories.
*
* `git ls-files --cached --others --exclude-standard` reads from the
* pre-built git index (~5ms) instead of walking the filesystem (~250ms
* with globby). The result is then post-filtered through the `ignore`
* package (for default/custom/repomixignore patterns) and `minimatch`
* (for include patterns) to produce the same file set as globby.
*
* Returns `null` when the fast path is not applicable (not a git repo,
* `useGitignore` disabled, or git command failure), signalling the
* caller to fall back to globby.
*/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/file/fileSearch.ts` around lines 95 - 107, The docstring in
src/core/file/fileSearch.ts incorrectly mentions "picomatch" while the
implementation uses Minimatch; update the comment so it references the actual
matcher (minimatch/Minimatch) used by the code instead of picomatch to avoid
confusion, e.g., replace the "picomatch (for include patterns)" text with
"minimatch (Minimatch) (for include patterns)" and ensure any wording aligns
with the matcher instantiation found in the file (new Minimatch(...)).

const searchFilesGitFastPath = async (
rootDir: string,
includePatterns: string[],
adjustedIgnorePatterns: string[],
ignoreFilePatterns: string[],
): Promise<string[] | null> => {
// Run git ls-files to get all non-gitignored files.
// --cached: tracked files in the index
// --others: untracked files not in .gitignore
// --exclude-standard: apply .gitignore + .git/info/exclude + global gitignore
// -z: NUL-separated output (handles filenames with special chars)
//
// Uses async execFile to keep the event loop free during the git execution,
// allowing concurrent async work (metrics warmup, prefetch sort data, git
// diff/log subprocesses) to make progress.
let stdout: string;
try {
const result = await execFileAsync('git', ['ls-files', '--cached', '--others', '--exclude-standard', '-z'], {
cwd: rootDir,
maxBuffer: 100 * 1024 * 1024,
encoding: 'utf-8',
});
stdout = result.stdout;
} catch {
logger.trace('git ls-files failed, falling back to globby');
return null;
}

// Deduplicate: git ls-files can list a file multiple times during
// merge conflicts (one entry per conflict stage).
let files = [...new Set(stdout.split('\0').filter(Boolean))];
logger.trace(`git ls-files returned ${files.length} files`);

// Filter out symlinks and non-regular files to match globby's behavior
// (followSymbolicLinks:false + onlyFiles:true reports symlinks as non-files).
files = files.filter((f) => {
try {
return lstatSync(path.join(rootDir, f)).isFile();
} catch {
return false;
}
});
Comment on lines +143 to +149
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using lstatSync in a loop for every file returned by git ls-files can be a significant performance bottleneck, especially in large repositories with thousands of files. This synchronous I/O blocks the Node.js event loop, which partially defeats the purpose of using async execFile to keep the event loop free. Consider using fs.lstat (async) with Promise.all to perform these checks concurrently.

Suggested change
files = files.filter((f) => {
try {
return lstatSync(path.join(rootDir, f)).isFile();
} catch {
return false;
}
});
const fileStats = await Promise.all(
files.map(async (f) => {
try {
const stats = await fs.lstat(path.join(rootDir, f));
return stats.isFile() ? f : null;
} catch {
return null;
}
}),
);
files = fileStats.filter((f): f is string => f !== null);

Comment on lines +141 to +149
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sync lstatSync in a per-file loop can block the event loop.

This filter calls lstatSync once per candidate file on the main thread. For a repo with, say, 10k tracked files that's 10k synchronous filesystem syscalls executed back-to-back, which partially defeats this PR's stated goal of keeping the event loop free so getGitDiffs/getGitLogs subprocesses and metricsWarmupPromise can make progress concurrently with file enumeration.

Using async fs.lstat with bounded concurrency (or Promise.all for moderate sizes) lets the concurrent git/metrics work actually be scheduled while stat syscalls are in flight.

♻️ Proposed async variant
-  // Filter out symlinks and non-regular files to match globby's behavior
-  // (followSymbolicLinks:false + onlyFiles:true reports symlinks as non-files).
-  files = files.filter((f) => {
-    try {
-      return lstatSync(path.join(rootDir, f)).isFile();
-    } catch {
-      return false;
-    }
-  });
+  // Filter out symlinks and non-regular files to match globby's behavior
+  // (followSymbolicLinks:false + onlyFiles:true reports symlinks as non-files).
+  const isRegularFile = await Promise.all(
+    files.map(async (f) => {
+      try {
+        return (await fs.lstat(path.join(rootDir, f))).isFile();
+      } catch {
+        return false;
+      }
+    }),
+  );
+  files = files.filter((_, i) => isRegularFile[i]);

The corresponding lstatSync / node:fs import becomes unused; drop it and rely on fs.lstat from node:fs/promises.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/file/fileSearch.ts` around lines 141 - 149, The synchronous
lstatSync loop in src/core/file/fileSearch.ts blocks the event loop for large
repos; replace the files = files.filter(...) block with an async filtering step
that uses fs.promises.lstat (or node:fs/promises.lstat) and bounded concurrency
(e.g., a small concurrency pool or Promise.allSettled for moderate sizes) to
test isFile() for each path, await the results, and then produce the filtered
files array; also remove the now-unused lstatSync import. Target the code around
the files variable and the current filter callback to perform the async lstat
checks and handle errors by excluding non-files.


// Build an ignore filter combining default + custom patterns and
// patterns from .repomixignore / .ignore files.
const ig = ignore();
ig.add(adjustedIgnorePatterns);

Comment on lines +153 to +155
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Git fast path over-filters files: ignore package treats slash-less patterns as matching at any depth, unlike globby

The ignore package follows gitignore semantics where patterns without a / character (like *.pid, .env, *.tgz, *.cache) match against the filename at any directory depth. Globby, in contrast, treats these as glob patterns that only match at the root level. This causes the git fast path to silently exclude files that the globby fallback would include.

Confirmed behavioral difference with test

globby with ignore: ['*.pid', '.env']:

  • packages/app/.env → INCLUDED
  • subdir/foo.pid → INCLUDED

ignore package with same patterns:

  • packages/app/.env → EXCLUDED
  • subdir/foo.pid → EXCLUDED

Affected default ignore patterns (from src/config/defaultIgnore.ts) include: .env, *.pid, *.seed, *.pid.lock, *.tgz, *.cache, .eslintcache, .node_repl_history, .lock-wscript, .hgignore. Any user-defined custom patterns without / are also affected. Since the git fast path is the default for all git repos (config.ignore.useGitignore && !explicitFiles at line 307), this silently changes which files are included in the output — for example, packages/app/.env in a monorepo that is tracked by git would be excluded by the fast path but included by globby.

Prompt for agents
The root cause is that rawIgnorePatterns (from getIgnorePatterns) are designed for globby's glob-based matching, but the ignore package uses gitignore semantics where slash-less patterns match at any depth. To fix this, the patterns passed to ig.add() need to be transformed so that slash-less patterns are anchored to the root. In gitignore semantics, prefixing a pattern with / anchors it to the root directory. So for each pattern in rawIgnorePatterns that does not contain a / character (e.g. *.pid, .env, *.tgz, *.cache), prefix it with / before adding it to the ignore instance. Patterns that already contain / (like dist/**, **/node_modules/**) should be left as-is since they already have correct path-anchoring behavior in both globby and the ignore package. The transformation should happen in searchFilesGitFastPath around line 153-155 where ig.add(adjustedIgnorePatterns) is called, or alternatively in prepareIgnoreContext when building rawIgnorePatterns.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

// Read ignore file patterns (e.g., **/.repomixignore, **/.ignore).
// Find matching files from the git output, read their patterns,
// and apply them scoped to the file's directory.
for (const filePattern of ignoreFilePatterns) {
const fileName = filePattern.replace(/^\*\*\//, '');
const matchingFiles = files.filter((f) => f === fileName || f.endsWith(`/${fileName}`));
Comment on lines +159 to +161
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Filtering the entire files array inside a loop for each ignore file pattern is inefficient (O(N * M)). Since ignoreFilePatterns usually contains a small number of fixed filenames (like .repomixignore), it is more efficient to perform a single pass over the files array to identify all relevant ignore files.

  const ignoreFileNames = ignoreFilePatterns.map((p) => p.replace(/^\*\*\//, ''));
  const ignoreFilesToProcess = files.filter((f) => {
    const base = path.basename(f);
    return ignoreFileNames.includes(base);
  });

  for (const ignoreFile of ignoreFilesToProcess) {


Comment on lines +160 to +162
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Git fast path misses .repomixignore/.ignore files that are gitignored but exist on disk

The git fast path discovers .repomixignore and .ignore files by searching within the git ls-files output (src/core/file/fileSearch.ts:161-162). However, git ls-files --cached --others --exclude-standard excludes files that are in .gitignore. If a .repomixignore file exists on disk but is gitignored (and untracked), it won't appear in the git output, so its patterns won't be applied. Globby's ignoreFiles option finds these files via filesystem traversal regardless of gitignore status. This means the git fast path could include files that should have been excluded by a gitignored .repomixignore.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

for (const ignoreFile of matchingFiles) {
try {
const content = await fs.readFile(path.join(rootDir, ignoreFile), 'utf-8');
const patterns = parseIgnoreContent(content);
const dir = path.dirname(ignoreFile);

if (dir === '.') {
// Root-level ignore file: patterns apply from root
ig.add(patterns);
} else {
// Nested ignore file: scope patterns to its directory.
// Prefix each pattern with the directory so the global
// ignore instance checks correctly against root-relative paths.
ig.add(
patterns.map((p) => {
if (p.startsWith('!')) return `!${dir}/${p.slice(1)}`;
if (p.startsWith('/')) return `${dir}${p}`;
return `${dir}/${p}`;
}),
);
}
} catch {
logger.trace(`Could not read ignore file: ${ignoreFile}`);
}
}
}

files = ig.filter(files);

// Apply include patterns (minimatch, same glob semantics as globby)
const isDefaultInclude = includePatterns.length === 1 && includePatterns[0] === '**/*';
if (!isDefaultInclude && includePatterns.length > 0) {
const matchers = includePatterns.map((p) => new Minimatch(p, { dot: true }));
files = files.filter((f) => matchers.some((m) => m.match(f)));
Comment on lines +195 to +196
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential regression here compared to globby semantics. globby (and fast-glob) automatically expands directory names in include patterns (e.g., src becomes src/**/*), whereas Minimatch performs literal glob matching. If a user provides a directory name in the include config, the git fast path will fail to match any files within that directory. Additionally, the PR description mentions using picomatch, but the implementation uses Minimatch. Consider using picomatch with appropriate options or expanding directory patterns to maintain consistency with the globby fallback.

}

return files;
};

// Get all file paths considering the config
export const searchFiles = async (
rootDir: string,
Expand Down Expand Up @@ -139,7 +253,10 @@ export const searchFiles = async (
}

try {
const { adjustedIgnorePatterns, ignoreFilePatterns } = await prepareIgnoreContext(rootDir, config);
const { adjustedIgnorePatterns, rawIgnorePatterns, ignoreFilePatterns } = await prepareIgnoreContext(
rootDir,
config,
);

logger.trace('Ignore patterns:', adjustedIgnorePatterns);
logger.trace('Ignore file patterns:', ignoreFilePatterns);
Expand Down Expand Up @@ -179,26 +296,50 @@ export const searchFiles = async (
logger.trace('Ignore patterns:', adjustedIgnorePatterns);
logger.trace('Ignore file patterns (for globby):', ignoreFilePatterns);

logger.debug('[globby] Starting file search...');
const globbyStartTime = Date.now();

const filePaths = await globby(includePatterns, {
...createBaseGlobbyOptions(rootDir, config, adjustedIgnorePatterns, ignoreFilePatterns),
onlyFiles: true,
}).catch((error: unknown) => {
// Handle EPERM errors specifically
const code = (error as NodeJS.ErrnoException | { code?: string })?.code;
if (code === 'EPERM' || code === 'EACCES') {
throw new PermissionError(
`Permission denied while scanning directory. Please check folder access permissions for your terminal app. path: ${rootDir}`,
rootDir,
);
// Try the git ls-files fast path first. It reads from the pre-built git
// index (~5ms) instead of walking the filesystem (~250ms with globby).
// Falls back to globby when:
// - not in a git repo
// - useGitignore is disabled (git ls-files relies on .gitignore)
// - explicit files are provided (stdin mode uses exact paths)
// - git command fails for any reason
const canUseGitFastPath = config.ignore.useGitignore && !explicitFiles;
let filePaths: string[] | null = null;

if (canUseGitFastPath) {
const gitStartTime = Date.now();
// Use raw (unescaped) include patterns for the git fast path.
// escapeGlobPattern transforms `src/(foo)` → `src/\(foo\)` for globby,
// but Minimatch needs literal patterns that match actual file paths.
const rawIncludePatterns = config.include.length > 0 ? config.include : ['**/*'];
filePaths = await searchFilesGitFastPath(rootDir, rawIncludePatterns, rawIgnorePatterns, ignoreFilePatterns);
if (filePaths !== null) {
const gitElapsedTime = Date.now() - gitStartTime;
logger.debug(`[git ls-files] Completed in ${gitElapsedTime}ms, found ${filePaths.length} files`);
}
throw error;
});
}

if (filePaths === null) {
logger.debug('[globby] Starting file search...');
const globbyStartTime = Date.now();

filePaths = await globby(includePatterns, {
...createBaseGlobbyOptions(rootDir, config, adjustedIgnorePatterns, ignoreFilePatterns),
onlyFiles: true,
}).catch((error: unknown) => {
const code = (error as NodeJS.ErrnoException | { code?: string })?.code;
if (code === 'EPERM' || code === 'EACCES') {
throw new PermissionError(
`Permission denied while scanning directory. Please check folder access permissions for your terminal app. path: ${rootDir}`,
rootDir,
);
}
throw error;
});

const globbyElapsedTime = Date.now() - globbyStartTime;
logger.debug(`[globby] Completed in ${globbyElapsedTime}ms, found ${filePaths.length} files`);
const globbyElapsedTime = Date.now() - globbyStartTime;
logger.debug(`[globby] Completed in ${globbyElapsedTime}ms, found ${filePaths.length} files`);
}

let emptyDirPaths: string[] = [];
if (config.output.includeEmptyDirectories) {
Expand Down Expand Up @@ -265,7 +406,11 @@ export const parseIgnoreContent = (content: string): string[] => {
const prepareIgnoreContext = async (
rootDir: string,
config: RepomixConfigMerged,
): Promise<{ adjustedIgnorePatterns: string[]; ignoreFilePatterns: string[] }> => {
): Promise<{
adjustedIgnorePatterns: string[];
rawIgnorePatterns: string[];
ignoreFilePatterns: string[];
}> => {
const [ignorePatterns, ignoreFilePatterns] = await Promise.all([
getIgnorePatterns(rootDir, config),
getIgnoreFilePatterns(config),
Expand All @@ -289,7 +434,21 @@ const prepareIgnoreContext = async (
}
}

return { adjustedIgnorePatterns, ignoreFilePatterns };
// Raw (un-normalized) patterns are used by the git fast path where
// the `ignore` package applies gitignore semantics directly. The
// normalizeGlobPattern transform (e.g., `**/foo` → `**/foo/**`) is
// designed for globby and would break `ignore`-based matching for
// file-level patterns like `**/package-lock.json`.
const rawIgnorePatterns = [...ignorePatterns];
if (isWorktree) {
const gitIndex = rawIgnorePatterns.indexOf('.git/**');
if (gitIndex !== -1) {
rawIgnorePatterns.splice(gitIndex, 1);
rawIgnorePatterns.push('.git');
}
}

return { adjustedIgnorePatterns, rawIgnorePatterns, ignoreFilePatterns };
};

/**
Expand Down
19 changes: 13 additions & 6 deletions src/core/packager.ts
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,17 @@ export const pack = async (
);

try {
// Run file collection and git operations in parallel since they are independent:
// - collectFiles reads file contents from disk
// - getGitDiffs/getGitLogs spawn git subprocesses
// Neither depends on the other's results.
// Start git operations early (before file collection) so their subprocesses
// run concurrently with collectFiles. Since searchFiles uses async execFile
// for `git ls-files` (non-blocking), the event loop is free to make progress
// on these git diff/log subprocesses during any I/O wait.
const gitDiffPromise = deps.getGitDiffs(rootDirs, config);
const gitLogPromise = deps.getGitLogs(rootDirs, config);
// Suppress unhandled rejection if collectFiles throws before Promise.all
// consumes these promises. The original promises are still awaited below.
Promise.resolve(gitDiffPromise).catch(() => {});
Promise.resolve(gitLogPromise).catch(() => {});
Comment on lines +132 to +133
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of Promise.resolve() here is redundant because gitDiffPromise and gitLogPromise are already promises returned by async functions. You can attach the .catch() directly to the promises to suppress the unhandled rejection warning.

Suggested change
Promise.resolve(gitDiffPromise).catch(() => {});
Promise.resolve(gitLogPromise).catch(() => {});
gitDiffPromise.catch(() => {});
gitLogPromise.catch(() => {});


progressCallback('Collecting files...');
const [collectResults, gitDiffResult, gitLogResult] = await Promise.all([
withMemoryLogging(
Expand All @@ -136,8 +143,8 @@ export const pack = async (
),
),
),
deps.getGitDiffs(rootDirs, config),
deps.getGitLogs(rootDirs, config),
gitDiffPromise,
gitLogPromise,
]);

const rawFiles = collectResults.flatMap((curr) => curr.rawFiles);
Expand Down
Loading