Skip to content

Handle large files gracefully#302

Merged
yamadashy merged 7 commits intoyamadashy:mainfrom
slavashvets:main
Jan 22, 2025
Merged

Handle large files gracefully#302
yamadashy merged 7 commits intoyamadashy:mainfrom
slavashvets:main

Conversation

@slavashvets
Copy link
Contributor

@slavashvets slavashvets commented Jan 20, 2025

Fixes critical out-of-memory errors when processing repositories containing large files

The Problem

When accidentally adding a large file to a repository (such as a 480MB txt file), Repomix would try to load it into memory and crash with a Node.js heap out-of-memory error:

❯ repomix

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
[1]    63614 abort      repomix

This provides a poor user experience as:

  1. The error is not user-friendly or actionable
  2. The program crashes instead of gracefully handling the situation
  3. Users don't know which file caused the problem

The Solution

Added a size limit check (50MB) before loading files into memory, with clear user guidance:

📦 Repomix v0.2.21

⚠️ Large File Warning:
────────────────────────────────────────────────
File exceeds size limit: 486.8MB > 50MB (/path/to/large/file)
Add this file to .repomixignore if you want to exclude it permanently

✔ Packing completed successfully!

Changes

  • Added 50MB file size limit to prevent memory issues
  • Added user-friendly warning message that:
    • Shows the file size and limit
    • Points to the specific file causing the issue
    • Suggests adding it to .repomixignore

Why

Large files are often added by accident (logfiles, lock files, etc.). Instead of crashing, Repomix should help users identify and handle these files appropriately. The 50MB limit was chosen because:

  1. It's well above typical source code file sizes
  2. It matches GitHub's recommended file size limit
  3. It ensures stable memory usage even on systems with limited RAM

…mory errors

BREAKING CHANGE: files larger than 50MB will now be skipped with a warning
instead of being processed. This prevents out-of-memory crashes but may
change behavior for repositories containing large files.

Changes:
- Added 50MB file size limit to prevent memory issues
- Added user-friendly warning message

The 50MB limit matches GitHub's recommended file size limit and ensures
stable memory usage even on systems with limited RAM.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 20, 2025

📝 Walkthrough

Walkthrough

The changes in src/core/file/fileCollect.ts introduce a new constant MAX_FILE_SIZE set to 50MB to address potential memory issues when processing large files. The readRawFile function now includes a preliminary file size check using fs.stat() before attempting to read the file contents. If a file exceeds the maximum size limit, the function logs a warning and returns null, effectively skipping the file. The existing binary file detection logic remains intact but is now performed after the size check. This modification enhances the file processing mechanism by adding a size-based filtering step to prevent potential out-of-memory errors and improve overall file handling robustness.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant readRawFile
    participant fs
    
    Caller->>readRawFile: Request to read file
    readRawFile->>fs: Check file size (fs.stat)
    alt File size exceeds MAX_FILE_SIZE
        readRawFile-->>Caller: Return null (log warning)
    else File size within limit
        readRawFile->>readRawFile: Check if binary file
        alt Is binary file
            readRawFile-->>Caller: Return null (log warning)
        else Not a binary file
            readRawFile->>readRawFile: Read file contents
            readRawFile-->>Caller: Return file contents
        end
    end
Loading
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/core/file/fileCollect.ts (1)

35-46: LGTM! Efficient size check with user-friendly warning.

The implementation efficiently checks file size before loading content and provides clear, actionable feedback to users.

Consider extracting formatting logic.

Consider moving the size formatting and warning message template to utility functions for better maintainability and reuse.

+const formatSizeInMB = (bytes: number): string => (bytes / 1024 / 1024).toFixed(1);
+
+const logFileSizeWarning = (filePath: string, actualSize: number, maxSize: number): void => {
+  logger.log('⚠️ Large File Warning:');
+  logger.log('────────────────────────────────────────────────');
+  logger.log(`File exceeds size limit: ${formatSizeInMB(actualSize)}MB > ${formatSizeInMB(maxSize)}MB (${filePath})`);
+  logger.note('Add this file to .repomixignore if you want to exclude it permanently');
+  logger.log('');
+};
+
 const readRawFile = async (filePath: string): Promise<string | null> => {
   try {
     const stats = await fs.stat(filePath);
     
     if (stats.size > MAX_FILE_SIZE) {
-      const sizeMB = (stats.size / 1024 / 1024).toFixed(1);
-      logger.log('⚠️ Large File Warning:');
-      logger.log('────────────────────────────────────────────────');
-      logger.log(`File exceeds size limit: ${sizeMB}MB > ${MAX_FILE_SIZE / 1024 / 1024}MB (${filePath})`);
-      logger.note('Add this file to .repomixignore if you want to exclude it permanently');
-      logger.log('');
+      logFileSizeWarning(filePath, stats.size, MAX_FILE_SIZE);
       return null;
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31bfb35 and 22a8823.

📒 Files selected for processing (1)
  • src/core/file/fileCollect.ts (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: CI
src/core/file/fileCollect.ts

[warning] 34-34: Extra whitespace found and automatically fixed by Biome formatter

🔇 Additional comments (4)
src/core/file/fileCollect.ts (4)

11-13: LGTM! Well-documented constant with appropriate size limit.

The 50MB limit is a reasonable choice that aligns with GitHub's recommendations and provides a good balance between accommodating legitimate source files while preventing memory issues.


48-52: LGTM! Optimal order of file checks.

The binary file check is correctly positioned after the size check, maintaining the efficient check order (quick checks before expensive ones).


34-34: Formatting issue already addressed.

The extra whitespace was automatically fixed by the Biome formatter.

🧰 Tools
🪛 GitHub Actions: CI

[warning] 34-34: Extra whitespace found and automatically fixed by Biome formatter


11-13: Verify potential impact on existing large files.

Let's check if there are any existing files in the codebase that would be affected by this size limit.

✅ Verification successful

No large files found that would be affected by the size limit

The 50MB file size limit is safe to implement as there are no existing files in the codebase that approach this limit.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find files larger than 45MB (90% of limit) to identify potential issues
find_large_files() {
  find . -type f -size +45M ! -path "*/\.*" -exec du -h {} \; | \
  sort -rh | \
  while read size file; do
    echo "Large file found: $file ($size)"
  done
}

echo "Checking for large files (>45MB)..."
find_large_files

Length of output: 194

- Add missing fs.stat mocks to handle large files check
- Change mocking approach to be more consistent with codebase
- Extract common fs.stat mock to beforeEach hook
@codecov
Copy link

codecov bot commented Jan 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.17%. Comparing base (31bfb35) to head (ff14cd4).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #302      +/-   ##
==========================================
+ Coverage   92.12%   92.17%   +0.04%     
==========================================
  Files          44       44              
  Lines        2236     2248      +12     
  Branches      491      493       +2     
==========================================
+ Hits         2060     2072      +12     
  Misses        176      176              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/core/file/fileCollect.test.ts (1)

66-88: Enhance test coverage and maintainability.

While the test correctly verifies basic large file handling, consider these improvements:

  1. Verify the complete warning message to ensure proper user guidance
  2. Add assertion that fs.readFile wasn't called for the large file
  3. Extract the size constant for better maintainability
+  const SIXTY_MB = 60 * 1024 * 1024;
   it('should skip large files', async () => {
     const mockFilePaths = ['large.txt', 'normal.txt'];
     const mockRootDir = '/root';

     vi.mocked(fs.stat)
       .mockResolvedValueOnce({  // for large.txt
-        size: 60 * 1024 * 1024,
+        size: SIXTY_MB,
         isFile: () => true,
       } as Stats)
       .mockResolvedValueOnce({  // for normal.txt
         size: 1024,
         isFile: () => true,
       } as Stats);
     vi.mocked(isBinary).mockReturnValue(false);
     vi.mocked(fs.readFile).mockResolvedValue(Buffer.from('file content'));
     vi.mocked(jschardet.detect).mockReturnValue({ encoding: 'utf-8', confidence: 0.99 });
     vi.mocked(iconv.decode).mockReturnValue('decoded content');

     const result = await collectFiles(mockFilePaths, mockRootDir);

     expect(result).toEqual([{ path: 'normal.txt', content: 'decoded content' }]);
-    expect(logger.log).toHaveBeenCalledWith('⚠️ Large File Warning:');
+    expect(logger.log).toHaveBeenCalledWith(
+      expect.stringContaining('⚠️ Large File Warning:'),
+      expect.stringContaining('large.txt'),
+      expect.stringContaining('50MB'),
+      expect.stringContaining('.repomixignore')
+    );
+    expect(fs.readFile).not.toHaveBeenCalledWith(
+      path.resolve('/root/large.txt'),
+      expect.any(Object)
+    );
   });
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22a8823 and f727a6d.

📒 Files selected for processing (1)
  • tests/core/file/fileCollect.test.ts (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Test (windows-latest, 23.x)
  • GitHub Check: Test (windows-latest, 22.x)
  • GitHub Check: Test (windows-latest, 21.x)
  • GitHub Check: Test (windows-latest, 20.x)
  • GitHub Check: Test (windows-latest, 19.x)
  • GitHub Check: Test (windows-latest, 18.x)
  • GitHub Check: Test (windows-latest, 18.0.0)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (2)
tests/core/file/fileCollect.test.ts (2)

1-1: LGTM! Well-structured mock setup for file stats.

The Stats type import and fs.stat mock setup are properly implemented to support the new file size checking functionality.

Also applies to: 20-25


53-55: LGTM! Improved mock readability.

The reformatted mock setup with inline comments makes the test's behavior more clear and maintainable.

- Remove trailing whitespace
- Fix comment alignment in mock definitions
- Move inline comments to their own lines in object literals
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/core/file/fileCollect.test.ts (1)

1-1: LGTM! Consider enhancing the stat mock.

The Stats import and basic stat mock setup look good. However, consider making the mock more robust by implementing all required Stats properties.

     vi.mocked(fs.stat).mockResolvedValue({
       size: 1024,
       isFile: () => true,
+      isDirectory: () => false,
+      isSymbolicLink: () => false,
     } as Stats);

Also applies to: 20-25

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f727a6d and 2c3ca53.

📒 Files selected for processing (1)
  • tests/core/file/fileCollect.test.ts (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Test (windows-latest, 23.x)
  • GitHub Check: Test (windows-latest, 22.x)
  • GitHub Check: Test (windows-latest, 21.x)
  • GitHub Check: Build and run (windows-latest, 22.x)
  • GitHub Check: Test (windows-latest, 20.x)
  • GitHub Check: Test (windows-latest, 19.x)
  • GitHub Check: Test (windows-latest, 18.x)
🔇 Additional comments (1)
tests/core/file/fileCollect.test.ts (1)

53-55: Clean formatting improvement!

The mock setup is now more readable with explicit comments indicating the behavior for each file type.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/core/file/fileCollect.ts (1)

35-48: LGTM! Efficient size check with user-friendly warning.

The implementation efficiently checks file size before loading content and provides clear, actionable feedback to users.

Consider adding the exact file size in bytes to the warning message for debugging purposes:

-logger.log(`File exceeds size limit: ${sizeMB}MB > ${MAX_FILE_SIZE / 1024 / 1024}MB (${filePath})`);
+logger.log(`File exceeds size limit: ${sizeMB}MB > ${MAX_FILE_SIZE / 1024 / 1024}MB (${stats.size} bytes) (${filePath})`);
tests/core/file/fileCollect.test.ts (1)

66-99: Enhance size-related assertions in the large file test.

The test implementation looks good and addresses previous review feedback. Consider adding these assertions for more precise size verification:

 expect(logger.log).toHaveBeenCalledWith(expect.stringContaining('File exceeds size limit:'));
+const expectedSizeMB = ((MAX_FILE_SIZE + 1024) / 1024 / 1024).toFixed(1);
+expect(logger.log).toHaveBeenCalledWith(
+  expect.stringContaining(`${expectedSizeMB}MB > ${MAX_FILE_SIZE / 1024 / 1024}MB`)
+);
 expect(logger.log).toHaveBeenCalledWith(expect.stringContaining(largePath));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c3ca53 and ff14cd4.

📒 Files selected for processing (2)
  • src/core/file/fileCollect.ts (2 hunks)
  • tests/core/file/fileCollect.test.ts (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: Test (windows-latest, 23.x)
  • GitHub Check: Test (windows-latest, 22.x)
  • GitHub Check: Test (windows-latest, 20.x)
  • GitHub Check: Test (windows-latest, 19.x)
  • GitHub Check: Test (windows-latest, 18.0.0)
🔇 Additional comments (4)
src/core/file/fileCollect.ts (2)

11-13: LGTM! Well-documented constant with appropriate size limit.

The 50MB limit aligns with the PR objectives and GitHub's recommended file size limit.


49-54: LGTM! Proper ordering of file checks.

The binary check is correctly positioned after the size check, with appropriate debug logging.

tests/core/file/fileCollect.test.ts (2)

8-8: LGTM! Proper test setup with necessary imports.

The import and mock setup provide a good foundation for testing the new functionality.

Also applies to: 21-25


53-55: LGTM! Improved mock readability.

The mock chain is now more readable with clear comments.

@yamadashy
Copy link
Owner

Hi, @slavashvets !
Thank you again for the PR! Really appreciate your continued contributions.

You're right that we don't need to process such large files.
Code looks perfect with minor adjustments I've made.

Going to merge this. Looking forward to your future contributions!

@yamadashy yamadashy merged commit 4302ae5 into yamadashy:main Jan 22, 2025
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants