Conversation
Add an archive entry filter that checks file extensions with isBinaryPath before writing to disk, avoiding unnecessary I/O for binary files (images, fonts, executables, etc.) that would be excluded later anyway. The filter strips the leading tar segment (e.g. "repo-branch/") since tar's filter callback receives paths before strip is applied. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdded binary file filtering to tar archive extraction by introducing a factory function that skips binary entries during download, integrated into the archive download pipeline through dependency injection with comprehensive test coverage. Changes
Sequence DiagramsequenceDiagram
participant Client as Client Code
participant Archive as downloadAndExtractArchive
participant Filter as createArchiveEntryFilter
participant TarExt as tarExtract
participant BinDetect as isBinaryPath
Client->>Archive: call with GitHub archive URL
Archive->>Filter: create filter with isBinaryPath dep
Filter-->>Archive: return filter function
Archive->>TarExt: call with filter in options
loop for each tar entry
TarExt->>Filter: call filter(entryPath)
Filter->>Filter: strip leading directory segment
alt is root directory
Filter-->>TarExt: return true (allow)
else check file type
Filter->>BinDetect: check if binary(stripped path)
alt binary file detected
BinDetect-->>Filter: true
Filter-->>TarExt: return false (skip)
else text file
BinDetect-->>Filter: false
Filter-->>TarExt: return true (allow)
end
end
end
TarExt-->>Archive: extraction complete
Archive-->>Client: resolved
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
⚡ Performance Benchmark
Details
|
Deploying repomix with
|
| Latest commit: |
55d3293
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://cc8392cd.repomix.pages.dev |
| Branch Preview URL: | https://perf-skip-binary-files-durin.repomix.pages.dev |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1392 +/- ##
=======================================
Coverage 87.39% 87.40%
=======================================
Files 115 116 +1
Lines 4378 4389 +11
Branches 1015 1018 +3
=======================================
+ Hits 3826 3836 +10
- Misses 552 553 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Code Review
This pull request introduces a binary file filter for GitHub archive extraction to optimize disk I/O by skipping non-text assets. The implementation includes a new filtering utility and its integration into the extraction pipeline. Review feedback suggests enhancing the filter's robustness by verifying entry types to ensure only regular files are evaluated and simplifying the callback function passed to the extraction library.
| return (entryPath: string): boolean => { | ||
| // Remove the leading directory segment that tar's strip:1 would remove | ||
| const strippedPath = entryPath.replace(/^[^/]+\//, ''); | ||
|
|
||
| if (!strippedPath) { | ||
| // Root directory entry — always allow | ||
| return true; | ||
| } | ||
|
|
||
| if (deps.isBinaryPath(strippedPath)) { | ||
| logger.trace(`Skipping binary file in archive: ${strippedPath}`); | ||
| return false; | ||
| } | ||
|
|
||
| return true; | ||
| }; |
There was a problem hiding this comment.
The filter should explicitly check the entry type to ensure it only applies to regular files. This prevents potential issues where directories or other entry types with names resembling binary extensions (e.g., a repository named test.zip) might be incorrectly skipped if the stripping logic fails or the trailing slash is missing. Using the entry object provided by tar is a more robust approach.
return (entryPath: string, entry?: any): boolean => {
// Only filter regular files; allow directories, symlinks, etc.
if (entry && entry.type !== 'File') {
return true;
}
// Remove the leading directory segment that tar's strip:1 would remove
const strippedPath = entryPath.replace(/^[^/]+\//, '');
if (!strippedPath) {
// Root directory entry — always allow
return true;
}
if (deps.isBinaryPath(strippedPath)) {
logger.trace(`Skipping binary file in archive: ${strippedPath}`);
return false;
}
return true;
};| const extractStream = deps.tarExtract({ | ||
| cwd: targetDirectory, | ||
| strip: 1, | ||
| filter: (entryPath: string) => entryFilter(entryPath), |
There was a problem hiding this comment.
The filter callback can be simplified by passing the entryFilter function directly. This also ensures that all arguments provided by the tar library (like the entry object) are correctly passed to the filter function.
| filter: (entryPath: string) => entryFilter(entryPath), | |
| filter: entryFilter, |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/core/git/gitHubArchive.ts (1)
171-176: Inline wrapper aroundentryFilteris unnecessary.You can pass
entryFilterdirectly to reduce noise and keep call contracts clearer in tests.Small cleanup
const entryFilter = deps.createArchiveEntryFilter(); const extractStream = deps.tarExtract({ cwd: targetDirectory, strip: 1, - filter: (entryPath: string) => entryFilter(entryPath), + filter: entryFilter, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/core/git/gitHubArchive.ts` around lines 171 - 176, The inline wrapper around entryFilter is unnecessary; replace the filter: (entryPath: string) => entryFilter(entryPath) argument passed to deps.tarExtract with filter: entryFilter so the createArchiveEntryFilter result from deps.createArchiveEntryFilter() is passed directly; update the call in the function using entryFilter (the variables entryFilter and extractStream created where deps.createArchiveEntryFilter() and deps.tarExtract(...) are invoked) and ensure the filter type matches the tarExtract contract.tests/core/git/gitHubArchive.test.ts (1)
99-121: Assert the factory dependency is invoked, not just that a filter exists.Good coverage improvement on
filter: expect.any(Function). Please also assert the injected factory is called so DI wiring can’t regress silently.Proposed test tightening
await downloadGitHubArchive(mockRepoInfo, mockTargetDirectory, mockOptions, undefined, mockDeps); + expect(mockCreateArchiveEntryFilter).toHaveBeenCalledTimes(1); + // Verify tar extract was called with correct options including filter expect(mockTarExtract).toHaveBeenCalledWith({ cwd: mockTargetDirectory, strip: 1, filter: expect.any(Function), });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/git/gitHubArchive.test.ts` around lines 99 - 121, The test currently only asserts that tar.extract received a filter function; also assert that the injected factory from mockDeps was invoked so dependency injection doesn't regress: after calling downloadGitHubArchive(mockRepoInfo, mockTargetDirectory, mockOptions, undefined, mockDeps) add an expectation that the factory on the deps object (e.g., mockDeps.createTarExtract or the specific factory mock used to produce mockTarExtract) was called (and optionally called with expected options), alongside the existing expect(mockTarExtract).toHaveBeenCalledWith(...) and filter assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/core/git/gitHubArchive.ts`:
- Around line 171-176: The inline wrapper around entryFilter is unnecessary;
replace the filter: (entryPath: string) => entryFilter(entryPath) argument
passed to deps.tarExtract with filter: entryFilter so the
createArchiveEntryFilter result from deps.createArchiveEntryFilter() is passed
directly; update the call in the function using entryFilter (the variables
entryFilter and extractStream created where deps.createArchiveEntryFilter() and
deps.tarExtract(...) are invoked) and ensure the filter type matches the
tarExtract contract.
In `@tests/core/git/gitHubArchive.test.ts`:
- Around line 99-121: The test currently only asserts that tar.extract received
a filter function; also assert that the injected factory from mockDeps was
invoked so dependency injection doesn't regress: after calling
downloadGitHubArchive(mockRepoInfo, mockTargetDirectory, mockOptions, undefined,
mockDeps) add an expectation that the factory on the deps object (e.g.,
mockDeps.createTarExtract or the specific factory mock used to produce
mockTarExtract) was called (and optionally called with expected options),
alongside the existing expect(mockTarExtract).toHaveBeenCalledWith(...) and
filter assertion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 63d8e942-bf30-4dcb-802e-b3c15bb19241
📒 Files selected for processing (4)
src/core/git/archiveEntryFilter.tssrc/core/git/gitHubArchive.tstests/core/git/archiveEntryFilter.test.tstests/core/git/gitHubArchive.test.ts
Skip binary files (images, fonts, executables, archives, etc.) during GitHub archive tar extraction by checking file extensions with
isBinaryPathbefore writing to disk.This avoids unnecessary disk I/O for files that would be excluded later in
readRawFileanyway. Particularly effective for repositories with many binary assets.Changes
src/core/git/archiveEntryFilter.ts— New module. Creates a filter function that strips the leading tar segment (repo-branch/) and checksisBinaryPathsrc/core/git/gitHubArchive.ts— AddedcreateArchiveEntryFilterto deps and passesfilteroption totarExtracttests/core/git/archiveEntryFilter.test.ts— 9 test cases covering text files, images, fonts, executables, root entries, nested paths, and DItests/core/git/gitHubArchive.test.ts— Updated mock deps and assertions for the new filterChecklist
npm run testnpm run lint