Skip to content

perf(core): Optimize file collection with UTF-8 fast path and promise pool#1155

Merged
yamadashy merged 3 commits intomainfrom
perf/utf8-fast-path-file-collect
Feb 17, 2026
Merged

perf(core): Optimize file collection with UTF-8 fast path and promise pool#1155
yamadashy merged 3 commits intomainfrom
perf/utf8-fast-path-file-collect

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Feb 17, 2026

Optimize file collection performance with two key changes:

UTF-8 fast path for encoding detection (fileRead.ts)

  • Try TextDecoder('utf-8', { fatal: true }) before falling back to jschardet.detect()
  • Since the vast majority of source files are UTF-8, this skips the expensive charset detection library in most cases
  • Handles UTF-8 BOM stripping in the fast path

Replace worker threads with promise pool for file collection (fileCollect.ts)

  • File collection is I/O-bound (reading files from disk), not CPU-bound
  • Worker thread overhead (pool initialization ~90ms + IPC serialization) exceeds the actual work
  • Simple promise pool with concurrency limit of 50 provides better throughput with zero overhead
  • Removed the now-unused fileCollectWorker.ts and related worker infrastructure

Note: CPU-bound workers (fileProcess, securityCheck, calculateMetrics) are intentionally kept as worker threads since they benefit from true parallelism.

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yamadashy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the file collection process by introducing a UTF-8 fast path for encoding detection and replacing worker threads with a more efficient promise pool for I/O-bound tasks. These changes aim to reduce overhead and improve performance during file processing, particularly for large codebases, while retaining worker threads for truly CPU-bound operations.

Highlights

  • UTF-8 Fast Path for Encoding Detection: Implemented a fast path in fileRead.ts to attempt UTF-8 decoding first, significantly reducing the need for the more expensive jschardet.detect() for the majority of source files. This also includes stripping UTF-8 BOM.
  • Replaced Worker Threads with Promise Pool for File Collection: The fileCollect process, being I/O-bound rather than CPU-bound, no longer uses worker threads. It has been refactored to use a simple promise pool with a concurrency limit of 50, eliminating worker overhead and improving throughput.
  • Removed File Collection Worker Infrastructure: The dedicated fileCollectWorker.ts file and all related worker infrastructure (imports, type definitions, and worker path configurations) have been removed, simplifying the codebase.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/core/file/fileCollect.ts
    • Removed dependency on initTaskRunner and fileCollectWorker.
    • Introduced a promisePool utility for concurrent file reading.
    • Integrated readRawFile directly for file content retrieval.
    • Updated SkippedFileInfo interface to be defined locally.
    • Adjusted file collection logic to use the promise pool and direct readRawFile calls.
  • src/core/file/fileRead.ts
    • Implemented a UTF-8 fast path using TextDecoder with fatal: true for initial decoding attempts.
    • Added logic to strip UTF-8 BOM in the fast path.
    • Simplified the fallback encoding detection and error handling, removing redundant UTF-8 specific error checks.
  • src/core/file/workers/fileCollectWorker.ts
    • Removed the entire worker file as file collection is now handled on the main thread.
  • src/shared/processConcurrency.ts
    • Removed fileCollect from the WorkerType enum.
    • Removed the case for fileCollect in getWorkerPath function.
  • src/shared/unifiedWorker.ts
    • Removed fileCollect from the WorkerType type definition.
    • Removed the fileCollect case from loadWorkerHandler.
    • Removed the logic to infer fileCollect worker type from tasks.
  • tests/core/file/fileCollect.test.ts
    • Removed mocks for fs/promises, is-binary-path, isbinaryfile, jschardet, iconv-lite, and logger.
    • Replaced mockInitTaskRunner with a mock for readRawFile.
    • Updated test assertions to reflect direct calls to readRawFile and the absence of worker-related mocks.
  • tests/integration-tests/packager.test.ts
    • Removed imports and mocks related to fileCollectWorker and WorkerOptions.
    • Adjusted the collectFiles dependency injection to use readRawFile directly instead of initTaskRunner.
  • tests/shared/processConcurrency.test.ts
    • Updated createWorkerPool test to use fileProcess worker type instead of fileCollect.
  • tests/shared/unifiedWorker.test.ts
    • Removed mocks for fileCollectWorker.js.
    • Removed the test case for inferring fileCollect from tasks.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 17, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR removes the fileCollectWorker and replaces worker-based file collection with an in-process promise pool concurrency model. It introduces a SkippedFileInfo type, changes dependency injection from initTaskRunner to readRawFile, optimizes UTF-8 decoding in file reads, and updates test mocks to reflect the new architecture.

Changes

Cohort / File(s) Summary
File Collection System
src/core/file/fileCollect.ts
Replaces worker pool with generic promise pool, introduces SkippedFileInfo type, changes dependency from initTaskRunner to readRawFile, adds progress tracking and per-file logging, aggregates results into rawFiles and skippedFiles.
File Reading Optimization
src/core/file/fileRead.ts
Adds fast-path UTF-8 decoding with TextDecoder before falling back to jschardet, removes post-detection UTF-8 re-decode fallback, simplifies handling of invalid sequences by logging and skipping rather than retrying.
Worker Infrastructure Removal
src/core/file/workers/fileCollectWorker.ts, src/shared/processConcurrency.ts, src/shared/unifiedWorker.ts
Removes fileCollectWorker file entirely (56 lines), eliminates fileCollect case from worker path routing, removes fileCollect from WorkerType union and related handler logic.
Core Test Refactoring
tests/core/file/fileCollect.test.ts
Replaces mocks for fs/promises, jschardet, iconv-lite, and logger with new mockReadRawFile dependency injection mechanism, updates binary/size/error handling tests to use new skip reason patterns.
Integration & Shared Test Updates
tests/integration-tests/packager.test.ts, tests/shared/processConcurrency.test.ts, tests/shared/unifiedWorker.test.ts
Updates integration test to use readRawFile instead of initTaskRunner, changes worker type expectations from fileCollect to fileProcess, removes fileCollect mock and inference test.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: UTF-8 fast path optimization and replacement of worker threads with a promise pool for file collection.
Description check ✅ Passed The description provides clear context on the optimization rationale, explains both UTF-8 fast path and promise pool changes, and completes the checklist. It follows the template structure with summary and checklist.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch perf/utf8-fast-path-file-collect

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 17, 2026

Codecov Report

❌ Patch coverage is 97.36842% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 87.13%. Comparing base (05af605) to head (05f11f4).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/core/file/fileRead.ts 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1155      +/-   ##
==========================================
- Coverage   87.19%   87.13%   -0.07%     
==========================================
  Files         116      115       -1     
  Lines        4390     4377      -13     
  Branches     1022     1016       -6     
==========================================
- Hits         3828     3814      -14     
- Misses        562      563       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/core/file/fileRead.ts (1)

50-50: Consider hoisting TextDecoder to module scope.

new TextDecoder('utf-8', { fatal: true }) is instantiated on every call. TextDecoder instances are reusable and stateless for decode() with default stream option (false). A module-level constant would avoid repeated allocation on hot paths.

♻️ Suggested change

Add at module scope:

+const utf8Decoder = new TextDecoder('utf-8', { fatal: true });

Then on line 50:

-      let content = new TextDecoder('utf-8', { fatal: true }).decode(buffer);
+      let content = utf8Decoder.decode(buffer);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/file/fileRead.ts` at line 50, Hoist the TextDecoder instantiation to
module scope to avoid recreating it on every call: create a module-level
constant (e.g., UTF8_DECODER) initialized with new TextDecoder('utf-8', { fatal:
true }) and replace the inline instantiation on the line that decodes buffer
(where content is assigned) with UTF8_DECODER.decode(buffer); ensure you keep
the same options and usage so only the construction moves to module scope while
decode(buffer) remains in the function.
src/core/file/fileCollect.ts (1)

69-74: Files with content: null and no skippedReason are silently dropped.

If readRawFile ever returns { content: null } without a skippedReason (e.g., from a future code path or unexpected state), the file would be silently omitted from both rawFiles and skippedFiles. Currently readRawFile always sets skippedReason when content is null, so this isn't a bug today, but the aggregation logic is fragile.

🛡️ Defensive alternative
   for (const { filePath, result } of results) {
     if (result.content !== null) {
       rawFiles.push({ path: filePath, content: result.content });
-    } else if (result.skippedReason) {
+    } else {
-      skippedFiles.push({ path: filePath, reason: result.skippedReason });
+      skippedFiles.push({ path: filePath, reason: result.skippedReason ?? 'encoding-error' });
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/file/fileCollect.ts` around lines 69 - 74, The loop in
fileCollect.ts that processes results currently only handles result.content !==
null and result.skippedReason, so any entry where content === null but
skippedReason is missing will be dropped; update the aggregation in the for
(const { filePath, result } of results) loop to defensively treat content ===
null with no skippedReason as a skipped file (push into skippedFiles) and
include a default reason like "unknown" or "no reason provided" so rawFiles and
skippedFiles remain exhaustive; reference the result object and the
rawFiles/skippedFiles arrays when adding this fallback handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/core/file/fileCollect.ts`:
- Around line 69-74: The loop in fileCollect.ts that processes results currently
only handles result.content !== null and result.skippedReason, so any entry
where content === null but skippedReason is missing will be dropped; update the
aggregation in the for (const { filePath, result } of results) loop to
defensively treat content === null with no skippedReason as a skipped file (push
into skippedFiles) and include a default reason like "unknown" or "no reason
provided" so rawFiles and skippedFiles remain exhaustive; reference the result
object and the rawFiles/skippedFiles arrays when adding this fallback handling.

In `@src/core/file/fileRead.ts`:
- Line 50: Hoist the TextDecoder instantiation to module scope to avoid
recreating it on every call: create a module-level constant (e.g., UTF8_DECODER)
initialized with new TextDecoder('utf-8', { fatal: true }) and replace the
inline instantiation on the line that decodes buffer (where content is assigned)
with UTF8_DECODER.decode(buffer); ensure you keep the same options and usage so
only the construction moves to module scope while decode(buffer) remains in the
function.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Feb 17, 2026

PR Review

Overall: Well-motivated performance optimization. The rationale for both changes is sound — file collection is I/O-bound and the UTF-8 fast path addresses a real hot path.

Highlights

  • UTF-8 fast path in fileRead.ts is a good optimization. TextDecoder('utf-8', { fatal: true }) correctly validates UTF-8 before falling through to the expensive jschardet.detect() path. The BOM handling is correct.
  • Worker → promise pool migration is well-justified for I/O-bound work. The concurrency limit of 50 is reasonable for balancing FD pressure and throughput.
  • Clean removal of fileCollectWorker.ts and all related worker infrastructure across processConcurrency.ts, unifiedWorker.ts, and tests.
  • Tests refactored cleanly — mocking at the readRawFile boundary is the right level of abstraction now that the worker indirection is gone.
  • Dependency injection pattern preserved via deps parameter, consistent with project conventions.

Issues

1. promisePool error propagation — low risk but worth noting

If fn throws, Promise.all rejects on the first error, but other in-flight workers continue running in the background until they complete or throw independently. In practice this is mitigated because readRawFile has a top-level try/catch that never throws, so the pool will always complete normally. However, as a general-purpose utility, promisePool doesn't handle errors gracefully — a thrown error could leave workers orphaned.

If this utility is intended for reuse beyond collectFiles, consider adding error handling (e.g., AbortController signal or at minimum documenting the contract that fn must not throw). For this PR's scope it's fine.

2. Removed error logging in collectFiles

The old code had logger.error('Error during file collection:', error) in a catch block before rethrowing. The new code has no top-level error handling. Since readRawFile catches its own errors this is functionally equivalent, but if a future change introduces a throwing code path in the pool callback (e.g., a bug in path.resolve), the error would propagate without any logging at the collection level. Minor concern.

3. completedTasks counter accuracy under concurrency

The completedTasks++ counter is used for progress display. Since JavaScript is single-threaded and the increment happens synchronously before any await, this is safe — no two workers will read the same value. Just flagging that this correctness depends on the single-threaded nature of Node.js, which holds here.

Test Coverage

  • The new tests cover: normal collection, binary skipping, size limits, custom maxFileSize, encoding errors, and progress callbacks.
  • The old test should use worker_threads runtime was replaced with should call progressCallback for each file, which tests the new architecture appropriately.
  • Integration test updated correctly to inject readRawFile directly.
  • One gap: no test for empty filePaths input (0 files). The promisePool handles this correctly (Math.min(concurrency, 0) → 0 workers → immediate return), but a test would document the behavior.

Premortem Analysis

Scenario Risk Mitigation
FD exhaustion with 50 concurrent reads Low 50 is well within typical OS limits (1024+); OS-level I/O scheduling handles this
Regression for non-UTF-8 files (Shift-JIS, etc.) Low Fast path correctly falls through to jschardet on TextDecoder failure
Breaking bundled environment Low fileCollect was removed from getWorkerPath and unifiedWorker; no runtime path references remain
TextDecoder not available in some environments Very Low Available in Node.js since v8.3.0; well within Repomix's support matrix
Performance regression if most files are non-UTF-8 Very Low For such codebases, every file gets decoded twice (TextDecoder + jschardet). Unlikely in practice

Verdict: Looks good to merge. The changes are well-scoped, the rationale is clear, and the code is clean.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations for file collection by switching from worker threads to a promise pool and adding a UTF-8 fast path. However, a high-severity Path Traversal vulnerability was identified in the fileCollect.ts module due to a lack of validation for file paths, which could allow unauthorized file access. Additionally, there's a suggestion to improve general error handling to prevent potential unhandled promise rejections.

…detection

Previously, every file went through jschardet.detect() which scans the entire
buffer through multiple encoding probers (MBCS, SBCS, Latin1) with frequency
table lookups — the most expensive CPU operation in file collection.

Since ~99% of source code files are UTF-8, we now try TextDecoder('utf-8',
{ fatal: true }) first. If it succeeds, jschardet and iconv are skipped entirely.
Non-UTF-8 files (e.g., Shift-JIS, EUC-KR) fall back to the original detection path.

Additionally, set concurrentTasksPerWorker=3 for fileCollect workers to better
overlap I/O waits within each worker thread.

Benchmark results (838 files, 10 CPU cores):
- Before: ~616ms
- After:  ~108ms (5.7x faster)
After the UTF-8 fast path optimization eliminated the CPU-heavy jschardet
bottleneck, file collection became I/O-bound. Worker threads now add pure
overhead (Tinypool init, structured clone, IPC) without benefit.

Benchmark (954 files, M2 Pro 10-core):
- Worker Threads: ~108ms → Promise Pool (c=50): ~37ms (2.9x faster)

Changes:
- Replace Tinypool worker dispatch with a simple promise pool (c=50)
- Inject readRawFile via deps for testability
- Remove unused concurrentTasksPerWorker from WorkerOptions
- Simplify tests to use readRawFile mock instead of 5+ module mocks
File collection was replaced with a promise pool approach in 96ff05dc,
but the worker-related code remained. This removes the now-unused
fileCollectWorker and all references to it from the worker system.
@yamadashy yamadashy force-pushed the perf/utf8-fast-path-file-collect branch from 80ad93d to 05f11f4 Compare February 17, 2026 14:09
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Feb 17, 2026

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 05f11f4
Status: ✅  Deploy successful!
Preview URL: https://9b291a4e.repomix.pages.dev
Branch Preview URL: https://perf-utf8-fast-path-file-col.repomix.pages.dev

View logs

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Feb 17, 2026

PR Review (Follow-up)

A previous review from Claude already covered the main points well. This follow-up focuses on evaluating AI bot suggestions and adding incremental observations.

AI Bot Comment Evaluation

CodeRabbit — Hoist TextDecoder to module scope (fileRead.ts:50)

Priority: Recommended

Valid suggestion. TextDecoder instances are stateless and reusable. Hoisting to module scope avoids repeated allocation on what is now the hot path for ~99% of files. The performance gain per call is small (object allocation), but across thousands of files it adds up. Worth doing if convenient, not blocking.

CodeRabbit — Defensive handling for content: null without skippedReason (fileCollect.ts:69-74)

Priority: Not needed

The FileReadResult type contract in fileRead.ts ensures that every content: null return also sets skippedReason. All 4 return paths with content: null in readRawFile set a skippedReason, and the catch-all at line 72 also sets 'encoding-error'. Adding a defensive fallback would mask future bugs rather than surfacing them. The current else if is the correct behavior — if a code path somehow returns { content: null } without a reason, it should be investigated, not papered over.

Gemini — Path Traversal vulnerability in fileCollect.ts

Priority: Not needed (for this PR)

The path.resolve(rootDir, filePath) pattern is pre-existing — the deleted fileCollectWorker.ts had the identical logic at line 28. This PR is a 1:1 behavioral migration from workers to a promise pool. While path traversal via stdin is a valid concern for the broader codebase, it's out of scope for this performance PR and should be tracked as a separate issue if desired.

One Additional Observation

The promisePool result array is pre-allocated with undefined slots
const results: R[] = Array.from({ length: items.length });

This creates an array of undefined values typed as R[]. If a worker throws before assigning to results[i], the corresponding slot remains undefined but typed as R. This is safe in practice because Promise.all would reject on the first error, so the results array is never consumed. But if the pool were ever changed to use Promise.allSettled, the undefined slots could leak through. A minor type-safety note, not actionable for this PR.

Verdict: Agree with the previous review — looks good to merge.

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

@yamadashy yamadashy merged commit dfe23c3 into main Feb 17, 2026
57 checks passed
@yamadashy yamadashy deleted the perf/utf8-fast-path-file-collect branch February 17, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant