perf(core): Replace ZIP archive with streaming tar.gz extraction#1153
perf(core): Replace ZIP archive with streaming tar.gz extraction#1153
Conversation
📝 WalkthroughWalkthroughThis PR migrates GitHub archive extraction from ZIP-based in-memory/temporary file handling to streaming tar.gz format with gunzip decompression. Archive URLs are updated from .zip to .tar.gz pattern, and the extraction pipeline is refactored to use streaming decomposition and tar extraction to disk. Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant HTTP as HTTP Fetch
participant Gunzip as Gunzip Stream
participant TarExtract as Tar Extractor
participant FileSystem as File System
Client->>HTTP: Request tar.gz archive URL
HTTP-->>Gunzip: Streaming compressed data
Gunzip->>Gunzip: Decompress gzip
Gunzip-->>TarExtract: Streaming decompressed tar data
TarExtract->>TarExtract: Parse tar entries
TarExtract->>FileSystem: Extract files to target directory
FileSystem-->>TarExtract: Write complete
TarExtract-->>Client: Extraction complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @yamadashy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the GitHub repository archive download and extraction mechanism. The previous ZIP-based approach, which relied on Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1153 +/- ##
=======================================
Coverage 87.13% 87.14%
=======================================
Files 115 115
Lines 4377 4310 -67
Branches 1016 998 -18
=======================================
- Hits 3814 3756 -58
+ Misses 563 554 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Deploying repomix with
|
| Latest commit: |
2e98b5c
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://51477703.repomix.pages.dev |
| Branch Preview URL: | https://refactor-replace-zip-with-ta.repomix.pages.dev |
There was a problem hiding this comment.
Code Review
This is an excellent refactoring that replaces the in-memory ZIP extraction with a full streaming pipeline for tar.gz archives. This change effectively resolves the memory consumption issues and limitations with large repositories. The switch to the tar package not only enables efficient streaming but also simplifies the codebase significantly by leveraging its built-in features for directory stripping and path traversal protection. The code is cleaner, more robust, and the accompanying test updates are thorough and well-adapted to the new implementation. Great work!
Code Review — Replace ZIP archive with streaming tar.gz extractionThis is a well-motivated refactoring that replaces the in-memory ZIP extraction (via Overall AssessmentApprove — The changes are clean, well-structured, and the streaming approach is the right architectural choice. A few minor observations below. Observations1. Dead code:
|
| Aspect | Assessment |
|---|---|
| Code quality | Clean simplification, good use of Node.js streaming primitives |
| Security | Path traversal handled by tar package — appropriate delegation |
| Performance | Streaming pipeline eliminates memory constraints — major improvement |
| Test coverage | Adequate; removed tests are correctly obsolete |
| Dependencies | fflate removed, tar added — net improvement for this use case |
No blocking issues. Nice refactoring.
Reviewed by Claude
d34764b to
452e087
Compare
Code Review — Replace ZIP archive with streaming tar.gz extraction (updated)This is a well-motivated refactoring that replaces the in-memory ZIP extraction (via Overall AssessmentApprove — Clean, well-structured changes. The streaming approach is the right choice. A few observations below (none blocking). Observations1.
|
| Aspect | Assessment |
|---|---|
| Code quality | Clean simplification, good use of Node.js streaming primitives |
| Security | Path traversal handled by tar package (v7.x) — appropriate delegation |
| Performance | Streaming pipeline eliminates memory constraints — major improvement |
| Test coverage | Adequate; Codecov shows all modified lines covered |
| Dependencies | fflate removed, tar added — net improvement for this use case |
No blocking issues.
Reviewed by Claude
…tion The previous ZIP-based archive download used fflate's in-memory extraction, which failed on large repositories (e.g. facebook/react) due to memory constraints and ZIP64 limitations. Switch to tar.gz format with Node.js built-in zlib + tar package, enabling a full streaming pipeline (HTTP response -> gunzip -> tar extract -> disk) with no temporary files and constant memory usage regardless of repo size. Key changes: - Replace fflate with tar package for archive extraction - Change archive URLs from .zip to .tar.gz - Use streaming pipeline instead of download-then-extract - Leverage tar's built-in strip and path traversal protection - Explicitly destroy streams after pipeline for Bun compatibility - Use child_process runtime under Bun to avoid worker_threads hang
452e087 to
ef194b8
Compare
The workaround that forced child_process runtime under Bun was added to prevent hangs caused by fileCollect worker threads. Since fileCollect has been migrated to a promise pool on the main thread, the hang no longer occurs and this workaround is unnecessary.
The streaming tar.gz extraction no longer uses temporary files, making this filename generation function unnecessary.
Code Review — Replace ZIP archive with streaming tar.gz extraction (incremental)Note: Two prior Claude reviews exist on this PR. This review provides only incremental feedback on the current state of the code. Overall AssessmentApprove — This is a well-executed refactoring. The streaming tar.gz approach is the right architectural decision, and the implementation is clean. Corrections to Prior ReviewsThe two earlier Claude reviews mentioned The earlier review also mentioned a Bun workaround in One Remaining Actionable Item
|
The pipeline catch block was wrapping all errors with 'Failed to extract' prefix, causing the retry logic to treat transient network errors (ECONNRESET, ETIMEDOUT) as non-retryable extraction errors. Remove the catch wrapper so errors propagate as-is, allowing the retry loop to handle network failures correctly.
With the streaming pipeline, errors propagate as native Error objects rather than RepomixError, so the isExtractionError check was always false. Retrying extraction errors is acceptable since the retry loop is bounded to 3 attempts.
Code Review — Replace ZIP archive with streaming tar.gz extraction (final review)Note: Three prior Claude reviews exist on this PR. This review provides only NEW incremental feedback on the current state of the code. Overall AssessmentApprove — The code is clean, well-tested, and the streaming approach is architecturally sound. One minor item from prior reviews remains unaddressed. Still Open from Prior Review
|
| Aspect | Status |
|---|---|
| Prior feedback addressed | getArchiveFilename removed ✅, processConcurrency.ts out of scope (corrected) ✅ |
| Prior feedback still open | extractStream.destroy() missing — minor |
| New findings | Extraction error retry behavior changed — acceptable |
| Code quality | Clean, well-structured |
| Test coverage | Adequate for the new streaming approach |
No blocking issues.
Reviewed by Claude
The previous ZIP-based archive download used fflate's in-memory extraction, which failed on large repositories (e.g.
facebook/react) due to memory constraints and ZIP64 limitations.This PR switches to tar.gz format with Node.js built-in
zlib+tarpackage, enabling a full streaming pipeline:Key changes
fflatewithtarpackage — removesfflatefrom mainpackage.json(website retains its own dependency).zipto.tar.gz— GitHub supports both formatstar.extract({ strip: 1 })handles prefix removal and path traversal protection built-ingetArchiveFilename— streaming extraction doesn't use temp files, so filename generation is unnecessaryworker_threadsworkaround — the hang was caused byfileCollectworkers which have been migrated to a promise pool in perf(core): Optimize file collection with UTF-8 fast path and promise pool #1155, making this workaround unnecessaryBefore vs After
Checklist
npm run testnpm run lint