Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream directly to disk #175

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gabriel-samfira
Copy link

The current implementation saves everything to memory and extracts the zip files to memory, before copying to the filesystem. This can consume a huge amount of memory if artifacts are large. This change streams the zip file directly to disk and extracts it without loading the entire zip in memory first.

@dawidd6
Copy link
Owner

dawidd6 commented Aug 30, 2022

There are conflicts. Nevertheless this looks great, yet I'm not an expert with nodejs and it seems a bit scary for me, all those async stuff and so on 😄. I would like someone else to look at those changes and review them, as I don't feel so confident here and don't want to break people's existing workflows if something goes wrong.

The current implementation saves everything to memory and extracts
the zip files to memory, before copying to the filesystem. This can
consume a huge amount of memory if artifacts are large. This change
streams the zip file directly to disk and extracts it without loading
the entire zip in memory first.
@gabriel-samfira
Copy link
Author

Hi @dawidd6 ,

Rebased on latest main branch.

I am no Javascript expert either. Had to go through a bunch of docs to write this PR. It seems to work, and from my understanding of how the async/await and promises work, that bit of the code should be correct. However, I would like for someone that actually writes Javascript more often than me to review this as well :).

We ended up using my fork for now so there is no urgency, but would love to not have to maintain it separately.

Cheers,
Gabriel

Copy link

@joandvgv joandvgv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a bit odd, I don't have a lot of context on the business logic involved here, but I did the review from a JS perspective which is what was required. My only concern here is error handling when making the GET request.

Comment on lines +202 to +204
if (!fs.existsSync(path)) {
fs.mkdirSync(path, { recursive: true })
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid using sync here, I would use a try catch using the fs.promises library like this:

try {
  await fs.promises.mkdir(path);
} catch(e) {
  // if error is already existing path, do nothing, throw it if it's something else. 
}

Not a big deal though, the code was already doing this before

main.js Outdated Show resolved Hide resolved
Comment on lines +235 to +238
file.on("error", () => {
core.info(`error saving file: ${err}`);
resolve()
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why aren't we rejecting the promise here?

}

core.startGroup(`==> Extracting: ${artifact.name}.zip`)
yauzl.open(saveTo, {lazyEntries: true}, function(err, zipfile) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we following any prettier / eslint rules here? The indentation and spacing seems odd here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is I wasn't. I learned just enough Javascript to write this PR 😄.

@gabriel-samfira
Copy link
Author

This PR is a bit odd, I don't have a lot of context on the business logic involved here, but I did the review from a JS perspective which is what was required. My only concern here is error handling when making the GET request.

Thanks for the review! I had completely forgotten about this PR. I will have a closer look as soon as I can.

The goal is to stream the files to disk instead of saving them to memory, then extract them directly to disk in chunks.

We wanted to use this in a project that required the download of large (multiple GB) files. This lead to the github runner running out of memory and the OOM killer would kick in and start killing processes. The jobs would get canceled. It was a bit of a head scratcher, because it wasn't immediately obvious why the jobs were failing. The only message in the action was "Canceled" (or something similar). For small artifacts it's not a problem. When downloading/unarchiving virtual machine disk images, it becomes an issue.

Co-authored-by: Joan Gil <[email protected]>
Copy link
Contributor

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extraction logic LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants