Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry when downloading/extraction of Node.js tar fails #784

Closed
trivikr opened this issue Sep 20, 2023 · 15 comments
Closed

Retry when downloading/extraction of Node.js tar fails #784

trivikr opened this issue Sep 20, 2023 · 15 comments

Comments

@trivikr
Copy link

trivikr commented Sep 20, 2023

Is your feature request related to a problem? Please describe.

Sometimes the downloading of Node.js artifact fails because of some network issue.

In our release CI, we noticed it happening twice on Sep 20th

First occurrence

       fetch : https://nodejs.org/dist/v14.21.3/node-v14.21.3-linux-x64.tar.xz
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
xz: (stdin): Unexpected end of input
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
   Error: failed to download archive for 14.21.3

The second occurrence on retry was similar, but for node v18.18.0

       fetch : https://nodejs.org/dist/v18.18.0/node-v18.18.0-linux-x64.tar.xz

Describe the solution you'd like

When fetching the tar of Node.js version fails, add a retry with some delay?

Describe alternatives you've considered

for i in {1..5}; do n 18 && break || sleep 15; done

Additional context

This issue was discussed in the past in #287, but closed as network error didn't appear again for the user

@trivikr trivikr changed the title Retry when installing Node.js tar fails. Retry when downloading of Node.js tar fails Sep 20, 2023
@trivikr trivikr changed the title Retry when downloading of Node.js tar fails Retry when downloading/extraction of Node.js tar fails Sep 20, 2023
@trivikr
Copy link
Author

trivikr commented Sep 20, 2023

The source code which attempts to download archive

n/bin/n

Lines 799 to 804 in 02a8fe0

log fetch "$url"
do_get "${url}" | tar "$tarflag" --strip-components=1 --no-same-owner -f -
pipe_results=( "${PIPESTATUS[@]}" )
if [[ "${pipe_results[0]}" -ne 0 ]]; then
abort "failed to download archive for $version"
fi

@shadowspawn
Copy link
Collaborator

Related to network problems, there have been some timeout suggestions: #544 #771

@trivikr
Copy link
Author

trivikr commented Sep 20, 2023

Similar request on nvm nvm-sh/nvm#3074

@shadowspawn
Copy link
Collaborator

I currently don't think a retry is appropriate in the n code. There are lots of things that can go wrong, including systematic problems that need fixing before an attempt can work. I prefer failing fast and leaving it up to the caller to decide how to proceed.

Happy to leave the issue open for a month to see if there are other comments.

@lstkz
Copy link

lstkz commented Sep 26, 2023

We are using AWS CodeBuild, and we see this error often. I think ~ 5% of builds fail because of it.

[Container] 2023/09/26 06:35:38 Running command n 18.14.2
--
20 | installing : node-v18.14.2
21 | mkdir : /usr/local/n/versions/node/18.14.2
22 | fetch : https://nodejs.org/dist/v18.14.2/node-v18.14.2-linux-x64.tar.xz
23 | curl: (18) transfer closed with 9476756 bytes remaining to read
24 | xz: (stdin): Unexpected end of input
25 | tar: Unexpected EOF in archive
26 | tar: Unexpected EOF in archive
27 | tar: Error is not recoverable: exiting now
28 |  
29 | Error: failed to download archive for 18.14.2

Maybe there is something wrong with nodejs cdn, and there are more stable mirrors?
Adding a retry logic (with an extra flag) would be useful.

@trivikr
Copy link
Author

trivikr commented Sep 26, 2023

I think ~ 5% of builds fail because of it.

We also noticed the same, when using n on AWS CodeBuild.

As of now, we've added retries as follows as a workaround:

for i in {1..5}; do n 18 && break || sleep 15; done

It would be helpful if n retries downloading/extraction of Node.js in case of failure.
The caller is likely going to retry installation in these cases, as the issue is with download/extraction.

There are lots of things that can go wrong, including systematic problems that need fixing before an attempt can work

How about adding retries for specific failures based on reports?
The reports here mention curl errors 18 and 92.

@shadowspawn
Copy link
Collaborator

Interesting that there is a common factor of AWS CodeBuild in first two reports. And n is preinstalled (1 2), nice.

I see that you can specify the node version as part of the AWS CodeBuild runtime setup, if that fits into your workflow.

@darkalor
Copy link

I've started seeing this exact error on our CI jobs over the last couple of weeks. We've been using the same deployment for years and I've never seen that before. Curious if something changed on node's CDN 🤔
image

A quick google search says this might be related to conflicting HTTP 1/2 settings on the server https://stackoverflow.com/a/68591734
It doesn't seem like there's currently a way to pass a setting like that through n. Perhaps that could be added? Not opposed to the idea of a retry setting either.

@shadowspawn
Copy link
Collaborator

Ah, a known problem due to CDN cache being purged multiple times per day:

@elibarzilay
Copy link
Contributor

Looking at the above two rejected PRs (#544 #771), it seems like the right choice, since implementing a retry is not a good idea.

But curl has a builtin --retry argument that works nicely, and even does an exponential backoff. That eliminates any complexities in actual code (just adding something like --retry 5 to $CURL_OPTIONS does the trick).

Two points to support this:

  • wget has an unfair advantage in that it defaults to retrying. Would make sense to make the curl use more similar.
  • Using a config file is an option, but there's already precedent in the code for adding curl arguments (--insecure), which could have been rejected in favor of your own file.

BTW, another problem with a global ~/.curlrc config file is that people are usually unaware of it, and even do some silly shell looping as above (and hopefully it's clear why that naive loop is a bad idea).

Oh, and regardless of node's web changes, we have intermittent issues from time to time due to build hosts running inside a vpn that can be unreliable sometimes. So this problem exists regardless.

@shadowspawn
Copy link
Collaborator

I hadn't noticed curl retry before. Interesting.

Using --retry would require a refactor to save the download to a file, as can't restart/retry data going to a pipe.

The --retry covers transient errors. Not sure whether the failures in this issue are transient as such.

  --retry <num>
         If a transient error is returned when curl tries to perform a transfer, it retries this number of times
         before giving up. Setting the number to 0 makes curl do no retries (which is the default). Transient error
         means either: a timeout, an FTP 4xx response code or an HTTP 408, 429, 500, 502, 503 or 504 response code.

@elibarzilay
Copy link
Contributor

  • Re errors: I don't remember how often, but I definitely saw transient errors. There are options like --retry-connrefused and even --retry-all-errors -- but on average cases I'd definitely not want that. (And using them can get the whole thing to be equivalent to the simple bash retrying loop...)

  • Re refactoring the code to have a file: I don't think that it's important. Even if it fetches the file from the start, the point is not to save the few bits and resume with a ranged request, but just to make it work.

    • In interactive use there won't be any difference, since if it fails I'd obviously run it again anyway.
    • And in a CI context, refetching the whole thing is still orders of magnitude cheaper than a failing job, which would be restarted anyway (and therefore refetch the whole thing anyway).

But since it was interesting to see if they do that, I searched a bit. Didn't find any conclusive answers, but:

  • An output file is not necessarily needed: I found this PR which (IIUC) resumes after the byte count that already poured out. It was abandoned, but they have a todo item for it. That's just an interesting comment.
  • Unless it does the same, wget would benefit from the same kind of refactoring to use a file. (Searching for wget is more difficult since its users tend to be less technical, so there's too many random pages that come up...)

So, keeping in mind my opinion above, using a file might be good anyway. And doing that shouldn't be too hard:

  local tmp="/some/tmp/file.tgz"
  rm -f "$tmp"
  { do_get_file "${url}" "$tmp" && tar "$tarflag" --strip-components=1 --no-same-owner -f "$tmp"; } \
  && { rm -f "$tmp"; abort "blah blah"; }

And this might simplify more code since stdout is now used as usual. (Handling output is usually what makes me switch to using a temp file with curl...)

@shadowspawn
Copy link
Collaborator

The Node.js caching improvements might be fully functional now. Fingers crossed!

@shadowspawn
Copy link
Collaborator

shadowspawn commented Nov 7, 2024

Cloudflare caching for /dist enabled just now: nodejs/build#3461 (comment)

@shadowspawn
Copy link
Collaborator

Hopefully this is much improved by new caching setup for Node.js

Feel free to open a new issue if it comes up again, with new information and renewed interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants