Skip to content

Conversation

@egibs
Copy link
Member

@egibs egibs commented Jan 25, 2025

We can speed up the extraction of large packages like Spark and Trino by using concurrency since both contain many .jar files. Both .jar and .zip archives can be extracted in parallel rather than sequentially which is fortuitous in this case since these packages, when fully extracted, amount to somewhere in the neighborhood of ~1e5 files.

Additionally, we can improve overall memory usage by using a buffer pool and io.CopyBuffer across all extraction methods.

Finally, I fixed .xz extractions not using a limit reader which slipped through previous optimizations.

@egibs egibs requested a review from stevebeattie January 25, 2025 18:51
@egibs egibs changed the title Extract .jar and .zip files conncurrently, use buffer for all io.Copy operations Extract .jar and .zip files concurrently, use buffer for all io.Copy operations Jan 25, 2025
Copy link
Member

@stevebeattie stevebeattie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and saw a slight performance increase in limited testing. Thanks!

@stevebeattie stevebeattie merged commit e7d91da into chainguard-dev:main Jan 27, 2025
9 checks passed
@egibs egibs deleted the zip-improvements branch January 28, 2025 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants