Skip to content

perf(core): Skip binary files during GitHub archive tar extraction#1392

Merged
yamadashy merged 1 commit intomainfrom
perf/skip-binary-files-during-archive-extraction
Apr 4, 2026
Merged

perf(core): Skip binary files during GitHub archive tar extraction#1392
yamadashy merged 1 commit intomainfrom
perf/skip-binary-files-during-archive-extraction

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Apr 4, 2026

Skip binary files (images, fonts, executables, archives, etc.) during GitHub archive tar extraction by checking file extensions with isBinaryPath before writing to disk.

Before: HTTP → gunzip → tar extract (all files) → globby → readFile → isBinaryPath exclusion
After:  HTTP → gunzip → tar extract (skip binary) → globby → readFile

This avoids unnecessary disk I/O for files that would be excluded later in readRawFile anyway. Particularly effective for repositories with many binary assets.

Changes

  • src/core/git/archiveEntryFilter.ts — New module. Creates a filter function that strips the leading tar segment (repo-branch/) and checks isBinaryPath
  • src/core/git/gitHubArchive.ts — Added createArchiveEntryFilter to deps and passes filter option to tarExtract
  • tests/core/git/archiveEntryFilter.test.ts — 9 test cases covering text files, images, fonts, executables, root entries, nested paths, and DI
  • tests/core/git/gitHubArchive.test.ts — Updated mock deps and assertions for the new filter

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

Add an archive entry filter that checks file extensions with isBinaryPath
before writing to disk, avoiding unnecessary I/O for binary files (images,
fonts, executables, etc.) that would be excluded later anyway.

The filter strips the leading tar segment (e.g. "repo-branch/") since tar's
filter callback receives paths before strip is applied.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

Added binary file filtering to tar archive extraction by introducing a factory function that skips binary entries during download, integrated into the archive download pipeline through dependency injection with comprehensive test coverage.

Changes

Cohort / File(s) Summary
Archive Entry Filter Logic
src/core/git/archiveEntryFilter.ts
New factory function createArchiveEntryFilter creates a filter predicate that strips leading tar path segments, detects binary files via is-binary-path, and returns false to skip binary entries while allowing text files and directories.
Archive Download Integration
src/core/git/gitHubArchive.ts
Integrated the entry filter into downloadAndExtractArchive by adding createArchiveEntryFilter dependency to ArchiveDownloadDeps and passing the created filter to tarExtract via the filter option.
Test Coverage
tests/core/git/archiveEntryFilter.test.ts, tests/core/git/gitHubArchive.test.ts
Added comprehensive test suite for filter behavior (text files, binary types, archives, directories, path stripping) and updated integration tests to verify filter is passed to tarExtract.

Sequence Diagram

sequenceDiagram
    participant Client as Client Code
    participant Archive as downloadAndExtractArchive
    participant Filter as createArchiveEntryFilter
    participant TarExt as tarExtract
    participant BinDetect as isBinaryPath

    Client->>Archive: call with GitHub archive URL
    Archive->>Filter: create filter with isBinaryPath dep
    Filter-->>Archive: return filter function
    Archive->>TarExt: call with filter in options
    
    loop for each tar entry
        TarExt->>Filter: call filter(entryPath)
        Filter->>Filter: strip leading directory segment
        alt is root directory
            Filter-->>TarExt: return true (allow)
        else check file type
            Filter->>BinDetect: check if binary(stripped path)
            alt binary file detected
                BinDetect-->>Filter: true
                Filter-->>TarExt: return false (skip)
            else text file
                BinDetect-->>Filter: false
                Filter-->>TarExt: return true (allow)
            end
        end
    end
    
    TarExt-->>Archive: extraction complete
    Archive-->>Client: resolved
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • yamadashy/repomix#1153: Extends tar.gz streaming extraction work by adding the entry filter and wiring it into the existing ArchiveDownloadDeps/tarExtract pipeline.
  • yamadashy/repomix#1006: Related binary detection refactor that migrates to is-binary-path across the codebase, which this PR also adopts in the new filter.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding logic to skip binary files during tar extraction as a performance optimization.
Description check ✅ Passed The description comprehensively covers the changes, provides before/after pipeline comparison, lists all modified files, includes testing details, and completes the required checklist.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/skip-binary-files-during-archive-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

⚡ Performance Benchmark

Latest commit:55d3293 perf(core): Skip binary files during GitHub archive tar extraction
Status:✅ Benchmark complete!
Ubuntu:1.53s (±0.02s) → 1.52s (±0.02s) · -0.01s (-0.4%)
macOS:0.90s (±0.12s) → 0.89s (±0.05s) · -0.01s (-1.6%)
Windows:1.89s (±0.05s) → 1.90s (±0.04s) · +0.01s (+0.4%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: 55d3293
Status: ✅  Deploy successful!
Preview URL: https://cc8392cd.repomix.pages.dev
Branch Preview URL: https://perf-skip-binary-files-durin.repomix.pages.dev

View logs

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 4, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 87.40%. Comparing base (ff7db1b) to head (55d3293).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/core/git/gitHubArchive.ts 50.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1392   +/-   ##
=======================================
  Coverage   87.39%   87.40%           
=======================================
  Files         115      116    +1     
  Lines        4378     4389   +11     
  Branches     1015     1018    +3     
=======================================
+ Hits         3826     3836   +10     
- Misses        552      553    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a binary file filter for GitHub archive extraction to optimize disk I/O by skipping non-text assets. The implementation includes a new filtering utility and its integration into the extraction pipeline. Review feedback suggests enhancing the filter's robustness by verifying entry types to ensure only regular files are evaluated and simplifying the callback function passed to the extraction library.

Comment on lines +10 to +25
return (entryPath: string): boolean => {
// Remove the leading directory segment that tar's strip:1 would remove
const strippedPath = entryPath.replace(/^[^/]+\//, '');

if (!strippedPath) {
// Root directory entry — always allow
return true;
}

if (deps.isBinaryPath(strippedPath)) {
logger.trace(`Skipping binary file in archive: ${strippedPath}`);
return false;
}

return true;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filter should explicitly check the entry type to ensure it only applies to regular files. This prevents potential issues where directories or other entry types with names resembling binary extensions (e.g., a repository named test.zip) might be incorrectly skipped if the stripping logic fails or the trailing slash is missing. Using the entry object provided by tar is a more robust approach.

  return (entryPath: string, entry?: any): boolean => {
    // Only filter regular files; allow directories, symlinks, etc.
    if (entry && entry.type !== 'File') {
      return true;
    }

    // Remove the leading directory segment that tar's strip:1 would remove
    const strippedPath = entryPath.replace(/^[^/]+\//, '');

    if (!strippedPath) {
      // Root directory entry — always allow
      return true;
    }

    if (deps.isBinaryPath(strippedPath)) {
      logger.trace(`Skipping binary file in archive: ${strippedPath}`);
      return false;
    }

    return true;
  };

const extractStream = deps.tarExtract({
cwd: targetDirectory,
strip: 1,
filter: (entryPath: string) => entryFilter(entryPath),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filter callback can be simplified by passing the entryFilter function directly. This also ensures that all arguments provided by the tar library (like the entry object) are correctly passed to the filter function.

Suggested change
filter: (entryPath: string) => entryFilter(entryPath),
filter: entryFilter,

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/core/git/gitHubArchive.ts (1)

171-176: Inline wrapper around entryFilter is unnecessary.

You can pass entryFilter directly to reduce noise and keep call contracts clearer in tests.

Small cleanup
     const entryFilter = deps.createArchiveEntryFilter();
     const extractStream = deps.tarExtract({
       cwd: targetDirectory,
       strip: 1,
-      filter: (entryPath: string) => entryFilter(entryPath),
+      filter: entryFilter,
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/core/git/gitHubArchive.ts` around lines 171 - 176, The inline wrapper
around entryFilter is unnecessary; replace the filter: (entryPath: string) =>
entryFilter(entryPath) argument passed to deps.tarExtract with filter:
entryFilter so the createArchiveEntryFilter result from
deps.createArchiveEntryFilter() is passed directly; update the call in the
function using entryFilter (the variables entryFilter and extractStream created
where deps.createArchiveEntryFilter() and deps.tarExtract(...) are invoked) and
ensure the filter type matches the tarExtract contract.
tests/core/git/gitHubArchive.test.ts (1)

99-121: Assert the factory dependency is invoked, not just that a filter exists.

Good coverage improvement on filter: expect.any(Function). Please also assert the injected factory is called so DI wiring can’t regress silently.

Proposed test tightening
     await downloadGitHubArchive(mockRepoInfo, mockTargetDirectory, mockOptions, undefined, mockDeps);

+    expect(mockCreateArchiveEntryFilter).toHaveBeenCalledTimes(1);
+
     // Verify tar extract was called with correct options including filter
     expect(mockTarExtract).toHaveBeenCalledWith({
       cwd: mockTargetDirectory,
       strip: 1,
       filter: expect.any(Function),
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/git/gitHubArchive.test.ts` around lines 99 - 121, The test
currently only asserts that tar.extract received a filter function; also assert
that the injected factory from mockDeps was invoked so dependency injection
doesn't regress: after calling downloadGitHubArchive(mockRepoInfo,
mockTargetDirectory, mockOptions, undefined, mockDeps) add an expectation that
the factory on the deps object (e.g., mockDeps.createTarExtract or the specific
factory mock used to produce mockTarExtract) was called (and optionally called
with expected options), alongside the existing
expect(mockTarExtract).toHaveBeenCalledWith(...) and filter assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/core/git/gitHubArchive.ts`:
- Around line 171-176: The inline wrapper around entryFilter is unnecessary;
replace the filter: (entryPath: string) => entryFilter(entryPath) argument
passed to deps.tarExtract with filter: entryFilter so the
createArchiveEntryFilter result from deps.createArchiveEntryFilter() is passed
directly; update the call in the function using entryFilter (the variables
entryFilter and extractStream created where deps.createArchiveEntryFilter() and
deps.tarExtract(...) are invoked) and ensure the filter type matches the
tarExtract contract.

In `@tests/core/git/gitHubArchive.test.ts`:
- Around line 99-121: The test currently only asserts that tar.extract received
a filter function; also assert that the injected factory from mockDeps was
invoked so dependency injection doesn't regress: after calling
downloadGitHubArchive(mockRepoInfo, mockTargetDirectory, mockOptions, undefined,
mockDeps) add an expectation that the factory on the deps object (e.g.,
mockDeps.createTarExtract or the specific factory mock used to produce
mockTarExtract) was called (and optionally called with expected options),
alongside the existing expect(mockTarExtract).toHaveBeenCalledWith(...) and
filter assertion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 63d8e942-bf30-4dcb-802e-b3c15bb19241

📥 Commits

Reviewing files that changed from the base of the PR and between ff7db1b and 55d3293.

📒 Files selected for processing (4)
  • src/core/git/archiveEntryFilter.ts
  • src/core/git/gitHubArchive.ts
  • tests/core/git/archiveEntryFilter.test.ts
  • tests/core/git/gitHubArchive.test.ts

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@yamadashy yamadashy merged commit 2accd6a into main Apr 4, 2026
67 checks passed
@yamadashy yamadashy deleted the perf/skip-binary-files-during-archive-extraction branch April 4, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant