perf(core): Optimize file collection with UTF-8 fast path and promise pool#1155
perf(core): Optimize file collection with UTF-8 fast path and promise pool#1155
Conversation
Summary of ChangesHello @yamadashy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the file collection process by introducing a UTF-8 fast path for encoding detection and replacing worker threads with a more efficient promise pool for I/O-bound tasks. These changes aim to reduce overhead and improve performance during file processing, particularly for large codebases, while retaining worker threads for truly CPU-bound operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR removes the fileCollectWorker and replaces worker-based file collection with an in-process promise pool concurrency model. It introduces a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1155 +/- ##
==========================================
- Coverage 87.19% 87.13% -0.07%
==========================================
Files 116 115 -1
Lines 4390 4377 -13
Branches 1022 1016 -6
==========================================
- Hits 3828 3814 -14
- Misses 562 563 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/core/file/fileRead.ts (1)
50-50: Consider hoistingTextDecoderto module scope.
new TextDecoder('utf-8', { fatal: true })is instantiated on every call.TextDecoderinstances are reusable and stateless fordecode()with default stream option (false). A module-level constant would avoid repeated allocation on hot paths.♻️ Suggested change
Add at module scope:
+const utf8Decoder = new TextDecoder('utf-8', { fatal: true });Then on line 50:
- let content = new TextDecoder('utf-8', { fatal: true }).decode(buffer); + let content = utf8Decoder.decode(buffer);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/file/fileRead.ts` at line 50, Hoist the TextDecoder instantiation to module scope to avoid recreating it on every call: create a module-level constant (e.g., UTF8_DECODER) initialized with new TextDecoder('utf-8', { fatal: true }) and replace the inline instantiation on the line that decodes buffer (where content is assigned) with UTF8_DECODER.decode(buffer); ensure you keep the same options and usage so only the construction moves to module scope while decode(buffer) remains in the function.src/core/file/fileCollect.ts (1)
69-74: Files withcontent: nulland noskippedReasonare silently dropped.If
readRawFileever returns{ content: null }without askippedReason(e.g., from a future code path or unexpected state), the file would be silently omitted from bothrawFilesandskippedFiles. CurrentlyreadRawFilealways setsskippedReasonwhencontentis null, so this isn't a bug today, but the aggregation logic is fragile.🛡️ Defensive alternative
for (const { filePath, result } of results) { if (result.content !== null) { rawFiles.push({ path: filePath, content: result.content }); - } else if (result.skippedReason) { + } else { - skippedFiles.push({ path: filePath, reason: result.skippedReason }); + skippedFiles.push({ path: filePath, reason: result.skippedReason ?? 'encoding-error' }); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/file/fileCollect.ts` around lines 69 - 74, The loop in fileCollect.ts that processes results currently only handles result.content !== null and result.skippedReason, so any entry where content === null but skippedReason is missing will be dropped; update the aggregation in the for (const { filePath, result } of results) loop to defensively treat content === null with no skippedReason as a skipped file (push into skippedFiles) and include a default reason like "unknown" or "no reason provided" so rawFiles and skippedFiles remain exhaustive; reference the result object and the rawFiles/skippedFiles arrays when adding this fallback handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/core/file/fileCollect.ts`:
- Around line 69-74: The loop in fileCollect.ts that processes results currently
only handles result.content !== null and result.skippedReason, so any entry
where content === null but skippedReason is missing will be dropped; update the
aggregation in the for (const { filePath, result } of results) loop to
defensively treat content === null with no skippedReason as a skipped file (push
into skippedFiles) and include a default reason like "unknown" or "no reason
provided" so rawFiles and skippedFiles remain exhaustive; reference the result
object and the rawFiles/skippedFiles arrays when adding this fallback handling.
In `@src/core/file/fileRead.ts`:
- Line 50: Hoist the TextDecoder instantiation to module scope to avoid
recreating it on every call: create a module-level constant (e.g., UTF8_DECODER)
initialized with new TextDecoder('utf-8', { fatal: true }) and replace the
inline instantiation on the line that decodes buffer (where content is assigned)
with UTF8_DECODER.decode(buffer); ensure you keep the same options and usage so
only the construction moves to module scope while decode(buffer) remains in the
function.
PR ReviewOverall: Well-motivated performance optimization. The rationale for both changes is sound — file collection is I/O-bound and the UTF-8 fast path addresses a real hot path. Highlights
Issues1.
|
| Scenario | Risk | Mitigation |
|---|---|---|
| FD exhaustion with 50 concurrent reads | Low | 50 is well within typical OS limits (1024+); OS-level I/O scheduling handles this |
| Regression for non-UTF-8 files (Shift-JIS, etc.) | Low | Fast path correctly falls through to jschardet on TextDecoder failure |
| Breaking bundled environment | Low | fileCollect was removed from getWorkerPath and unifiedWorker; no runtime path references remain |
TextDecoder not available in some environments |
Very Low | Available in Node.js since v8.3.0; well within Repomix's support matrix |
| Performance regression if most files are non-UTF-8 | Very Low | For such codebases, every file gets decoded twice (TextDecoder + jschardet). Unlikely in practice |
Verdict: Looks good to merge. The changes are well-scoped, the rationale is clear, and the code is clean.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations for file collection by switching from worker threads to a promise pool and adding a UTF-8 fast path. However, a high-severity Path Traversal vulnerability was identified in the fileCollect.ts module due to a lack of validation for file paths, which could allow unauthorized file access. Additionally, there's a suggestion to improve general error handling to prevent potential unhandled promise rejections.
…detection
Previously, every file went through jschardet.detect() which scans the entire
buffer through multiple encoding probers (MBCS, SBCS, Latin1) with frequency
table lookups — the most expensive CPU operation in file collection.
Since ~99% of source code files are UTF-8, we now try TextDecoder('utf-8',
{ fatal: true }) first. If it succeeds, jschardet and iconv are skipped entirely.
Non-UTF-8 files (e.g., Shift-JIS, EUC-KR) fall back to the original detection path.
Additionally, set concurrentTasksPerWorker=3 for fileCollect workers to better
overlap I/O waits within each worker thread.
Benchmark results (838 files, 10 CPU cores):
- Before: ~616ms
- After: ~108ms (5.7x faster)
After the UTF-8 fast path optimization eliminated the CPU-heavy jschardet bottleneck, file collection became I/O-bound. Worker threads now add pure overhead (Tinypool init, structured clone, IPC) without benefit. Benchmark (954 files, M2 Pro 10-core): - Worker Threads: ~108ms → Promise Pool (c=50): ~37ms (2.9x faster) Changes: - Replace Tinypool worker dispatch with a simple promise pool (c=50) - Inject readRawFile via deps for testability - Remove unused concurrentTasksPerWorker from WorkerOptions - Simplify tests to use readRawFile mock instead of 5+ module mocks
File collection was replaced with a promise pool approach in 96ff05dc, but the worker-related code remained. This removes the now-unused fileCollectWorker and all references to it from the worker system.
80ad93d to
05f11f4
Compare
Deploying repomix with
|
| Latest commit: |
05f11f4
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://9b291a4e.repomix.pages.dev |
| Branch Preview URL: | https://perf-utf8-fast-path-file-col.repomix.pages.dev |
PR Review (Follow-up)A previous review from Claude already covered the main points well. This follow-up focuses on evaluating AI bot suggestions and adding incremental observations. AI Bot Comment EvaluationCodeRabbit — Hoist
Valid suggestion. CodeRabbit — Defensive handling for
The Gemini — Path Traversal vulnerability in
The One Additional ObservationThe
|
Optimize file collection performance with two key changes:
UTF-8 fast path for encoding detection (
fileRead.ts)TextDecoder('utf-8', { fatal: true })before falling back tojschardet.detect()Replace worker threads with promise pool for file collection (
fileCollect.ts)fileCollectWorker.tsand related worker infrastructureNote: CPU-bound workers (
fileProcess,securityCheck,calculateMetrics) are intentionally kept as worker threads since they benefit from true parallelism.Checklist
npm run testnpm run lint