Download slow, breaking CI #2263

johnpangalos · 2020-04-06T08:18:24Z

Hello,

I noticed that the download speed of node js has become rather slow which breaks a our azure pipeline task. Basically it times out and our build doesn't continue.

Any help with this would be appreciated and let me know if you need anything from me, thanks!

Azure task

aviatrix · 2020-04-06T08:21:49Z

what i've seen is that nodejs.org itself is erroring out with error 500 for the past 30-40 minutes, now seems to be up again

johnpangalos · 2020-04-06T08:27:36Z

Download still seems to be slow unfortunately, even if the site is up.

adamayoung · 2020-04-06T08:51:08Z

Also getting this in my Azure Pipeline when trying to install Node 13.x

mihaimircea · 2020-04-06T09:11:12Z

Same error for trying to install node 12.16.1, from ubuntu-latest on Azure Pipelines

johnpangalos · 2020-04-06T09:22:33Z

I think the problem is that when the download takes too long the node azure pipelines task times out/is unable to use a half downloaded file. Trying the work around of just using the node apt package (I also use ubuntu)

Moussto · 2020-04-06T09:54:20Z

It's being tracked here nodejs/node#32683

johnpangalos · 2020-04-06T11:13:54Z

Ok, I'll close then.

johnpangalos · 2020-04-06T12:50:38Z

This is still technically a problem, and the more eyes on it the better.

ErisDS · 2020-04-06T14:55:54Z

This is causing a lot of problems with CI/build tools and in cases where these are charged for computation time, costing people a lot of money:

actions/setup-node#132

joepie91 · 2020-04-06T18:17:16Z

This is causing a lot of problems with CI/build tools and in cases where these are charged for computation time, costing people a lot of money

This sort of comment is really not okay.

It's not the build team's problem nor responsibility if people haven't bothered to set up some local redundancy/caching and the resulting failures are costing them money, especially for a free service. Everyone's business/infrastructure continuity is their own responsibility.

I'm sure they have enough on their mind already, without additional pressure from "this is costing us money!" type comments.

fishcharlie · 2020-04-06T18:31:20Z

@joepie91

This sort of comment is really not okay.

It's not the build team's problem nor responsibility if people haven't bothered to set up some local redundancy/caching and the resulting failures are costing them money, especially for a free service. Everyone's business/infrastructure continuity is their own responsibility.

I'm sure they have enough on their mind already, without additional pressure from "this is costing us money!" type comments.

This is fair. I do think it's extremely important to realize the increased frustration due to the size and impact of Node.js. The larger an open source project becomes, the more responsibility it has. I think it's fair to hold a project the size of Node.js to a higher standard than most open source projects.

It's nearly impossible in this day and age to have 0 dependencies and 100% redundancy.

The reality is that it is costing people money. It's also important to understand that yes, in an ideal world, redundancy and caching should exist. But a lot of systems don't have very good support for that. This is not an ideal world.

The Node.js team should take responsibility for this. It's unfair to place all the blame or responsibility on them. I think most of the frustration is due to the lack of updates from the Node.js team. And I believe that frustration is warranted.

For me personally, all I ask is more transparency and updates from the Node.js team.

joepie91 · 2020-04-06T18:36:37Z

The larger an open source project becomes, the more responsibility it has. I think it's fair to hold a project the size of Node.js to a higher standard than most open source projects.

To a degree, yes. But hardware breaks, servers break, and especially when something isn't a service that people are actually paying for and have an availability agreement for, I don't think it's reasonable to expect equivalent service to a commercial service, like people seem to do here (and in other threads about the issue).

The reality is that it is costing people money. It's also important to understand that yes, in an ideal world, redundancy and caching should exist. But a lot of systems don't have very good support for that. This is not an ideal world.

Sure. But it's one's own decision to actually use such imperfect systems without petitioning the vendors to fix them, and the costs for that decision/tradeoff should be shouldered by those picking the systems, not by some random third party that had nothing to do with the decision (Node.js core, in this particular case).

Actually, it's quite bizarre to me that people's build processes are apparently downloading the same thing over and over again without bothering to cache things locally in the first place, needlessly costing the nodejs.org operators money for the bandwidth.

I think most of the frustration is due to the lack of updates from the Node.js team.

An update was provided here. Presumably the status has not changed since. I'm not sure what other updates people are expecting from something that is, again, a free service.

addaleax · 2020-04-06T18:42:39Z

Yeah. I think this is the kind of thing that deserves a post-mortem with concrete steps to stop this from happening again, but for the moment, if there’s anything happening then you’ll see it here.

I think most of the frustration is due to the lack of updates from the Node.js team.

An update was provided here. Presumably the status has not changed since.

💯

MattIPv4 · 2020-04-06T20:13:00Z

A status page would likely help immensely with these communication issues and provide a canonical source of truth for updates about an incident: #2265

mhdawson · 2020-04-06T20:23:34Z

From top it does not look particularly bad, although if cloudflare was caching most of the downloads I'm not sure I'd expect a load average of 2

top - 20:23:11 up 164 days, 17:49,  1 user,  load average: 2.22, 1.87, 1.80
Tasks: 268 total,   3 running, 265 sleeping,   0 stopped,   0 zombie
%Cpu0  : 16.6 us, 10.3 sy,  0.0 ni, 71.5 id,  1.3 wa,  0.0 hi,  0.0 si,  0.3 st
%Cpu1  : 10.3 us, 12.6 sy,  0.0 ni, 73.5 id,  1.6 wa,  0.0 hi,  1.6 si,  0.3 st
%Cpu2  : 14.0 us, 14.0 sy,  0.0 ni, 70.0 id,  1.3 wa,  0.0 hi,  0.3 si,  0.3 st
%Cpu3  : 13.6 us,  9.9 sy,  0.0 ni, 63.9 id,  1.3 wa,  0.0 hi, 10.9 si,  0.3 st
%Cpu4  :  0.0 us,  0.3 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu5  :  0.6 us,  0.6 sy,  0.0 ni, 30.6 id,  0.0 wa,  0.0 hi, 67.4 si,  0.6 st
KiB Mem : 16432060 total,  1626516 free,  1394780 used, 13410764 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 14181820 avail Mem

tholu · 2020-04-06T20:29:48Z

Actually, it's quite bizarre to me that people's build processes are apparently downloading the same thing over and over again without bothering to cache things locally in the first place, needlessly costing the nodejs.org operators money for the bandwidth.

I'm sure lots of people in this thread would take the opportunity to setup a local cache/proxy for this (similar to Nexus for Maven Packages). Ideally there would be something like a local Squid proxy cache instance, as in most build pipelines, the URLs are to the NodeJS Binaries are often not easily changeable.

Major problem: Proxying and caching requests to e.g. https://nodejs.org/dist/v13.6.0/node-v13.6.0-linux-x64.tar.gz is not trivial, as it's served via https. @joepie91 Do you have some pointers how to set this up properly? Thank you!

mhdawson · 2020-04-06T20:33:03Z

I just downloaded 4-5 binaries without problem. Is there something more specific than downloading from nodejs.org/en/download to trigger this?

targos · 2020-04-06T20:34:30Z

I can trigger this by downloading from that address.

joepie91 · 2020-04-06T20:34:45Z

@joepie91 Do you have some pointers how to set this up properly?

I suspect that in most cases, people will be using proprietary hosted CI services, in which case the first place to look would be that specific service's capabilities for caching sources. I can't advise here, unfortunately.

For those who control their own infrastructure, I think the easiest approach would be to reverse-proxy to nodejs.org, exposing that on one's own subdomain (think nodejs.org.sources.example.com or such), and then just using your own subdomain in the source file URLs rather than nodejs.org. That sidesteps all the TLS issues.

Optionally basic auth could be used, if the proxying server is on a public network, to avoid having half the world proxy through it.

For caching npm, which is another likely failure point, there's off-the-shelf software such as Verdaccio: https://verdaccio.org/

Trott · 2020-04-06T20:36:33Z

I just downloaded 4-5 binaries without problem. Is there something more specific than downloading from nodejs.org/en/download to trigger this?

Maybe look at any Cloudflare logs you might have access to?

mhdawson · 2020-04-06T20:37:02Z

The rate of new entries in /var/log/nginx/acces.log log does not seem to too much maybe 10-15 per minute so most of the traffic must still be being served by cloudflare.

Trott · 2020-04-06T20:37:50Z

Or look for 500 in the server logs? I'm being a bit simplistic here, but I imagine we shouldn't get 500s. It's a bit ol' static site, as far as I know.

mmarchini · 2020-04-06T20:39:03Z

Aren't 500 stored on /var/log/nginx/error.log?

mkonikov · 2020-04-06T20:39:15Z

In our CI get this often

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
xz: (stdin): Unexpected end of input
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
cp: cannot stat ‘/usr/local/n/versions/node/10.16.0/lib’: No such file or directory
cp: cannot stat ‘/usr/local/n/versions/node/10.16.0/include’: No such file or directory

or

curl: (22) The requested URL returned error: 500 

  Error: download preflight failed for '10.16.0' (https://nodejs.org/dist/v10.16.0/node-v10.16.0-linux-x64.tar.xz)

targos · 2020-04-06T20:39:28Z

@mhdawson have a look at /var/log/nginx/nodejs/nodejs.org-access.log and /var/log/nginx/nodejs/nodejs.org-error.log

mhdawson · 2020-04-06T20:56:20Z

Going to see if there is anything in the cloudflare logs

sam-github · 2020-04-06T21:04:16Z

HTTP/2 stream 0 was not closed cleanly

^--- that's interesting, do we always serve over HTTP/2? Is it possible something changed wrt. that?

mhdawson · 2020-04-06T21:09:39Z

One thing on cloudflare that stands out so far is that for the last month only 12% of the traffic was uncached, and stat for last week is similar. Uncached for the last 6 hours is 48%

jgehrcke · 2020-04-06T21:10:26Z

I don't think that Cloudflare in front of a single box is the right solution for serving these dist files when robustness is the goal.

Even small hiccups in the backend are exposed to end users.

The single machine that we talk about here is in the hot path more often than it should be. For Cloudflare to fall back to serving from cache while the backend is not available very specific conditions need to be met: https://support.cloudflare.com/hc/en-us/articles/200168436-Understanding-Cloudflare-Always-Online -- specifically, transport layer problems do not trigger "always on", and HTTP responses with status code 500 also don't. These problems are seen by "end users" (well, those HTTP clients trying to download).

I myself have Cloudflare in front of a single box and was often surprised how this setup does not magically heal transient problems in the backend.

Let's strive towards a solution where the backend is never in the hot path.

Typically, dist files like the ones we talk about here should be served by a real CDN like fastly (which sponsors the hosting for Python), cloudfront, Google Cloud CDN, etc.

mhdawson · 2020-04-06T21:13:30Z

Are people still seeing the issue? I have yet to have a download be slow or fail?

mkonikov · 2020-04-06T21:16:56Z

In our CI our build at 3:52:30 pm ET failed. Running another one now and will advise if run into issues.

targos · 2020-04-06T21:24:16Z

@mhdawson I seems back to normal. The load on the server is much lower than before, SSH stopped lagging and kswapd0 / ksoftirqd/0 no longer use the CPU.

tholu · 2020-04-06T21:25:20Z

@mhdawson Seems fast again now for me, just in time as I set up a simple caching of the needed artifacts. How did you fix it?

mhdawson · 2020-04-06T21:49:25Z

Interestingly it also seems like the cached bandwidth on cloudflare as a ratio to total traffic is going back up. That started at around 4:30 EST PM today based on the graph. As of 5:15 EST is was down to 14% versus 50% when we saw bad performance. It started to go up in terms of the uncached percentage around 7PM EST on April 5th.

As @targos mentioned the load on our machines also seems lower which would make sense if more of the load is being cached.

@tholu I wish I could say I fixed something but I've not changed anything. Was just ramping up looking at the logs and trying to figure out what was going on :).

There were some outages reported for cloudflare today, but nothing that seems related or with times that match the period of time we saw the higher uncached bandwidth.

Apr 6, 2020
Issues enabling DNSSEC
Resolved - Cloudflare has resolved the issue and services have resumed normal operation.
Apr 6, 13:44 UTC
Monitoring - Cloudflare has implemented a fix for this issue and is currently monitoring the results.

We will update the status once the issue is resolved.
Apr 6, 13:35 UTC
Investigating - Cloudflare is investigating issues with enabling DNSSEC.

This does not affect existing DNSSEC configurations, which are working as expected.

We are working to understand the full impact and mitigate this problem. More updates to follow shortly.
Apr 6, 12:53 UTC
Performance Issues in Frankfurt, Germany
Resolved - This incident has been resolved.
Apr 6, 12:12 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Apr 6, 12:05 UTC
Investigating - We are investigating possible performance issues in Frankfurt, Germany (FRA)
Apr 6, 11:29 UTC

Our cloudflare logs are being pushed to google storage and I've not yet found how to get those, if things are working better now then looking at those can probably wait until @rvagg is online and can take a look or point me to instructions/doc as to how they are stored.

Looking further back in the access logs on our machine I do see lots of
500 errors for downloads. Tailing the access log file now I don't see any.

My main guess is that for some reason less of the traffic was served from the cache between 7PM EST on April 5th until today around 4:30 EST where it started to go back to normal.

MattIPv4 · 2020-04-06T22:05:27Z

A suggestion for improving caching here would be to increase the age on the cache control header for /dist/*.

As I noted in Slack, it seems these are currently served with a max-age of four hours, which seems incredibly low considering these files will never change. Why not set something much higher, like a year?

I have no insight into the Cloudflare config, but it’s worth also checking caching for these files is enabled there. Maybe setup a page rule to ensure they have a long cache configured via Cloudflare directly as well?

mhdawson · 2020-04-06T22:07:05Z

Submitted ticket 1864572 with Cloudflare to see if they are aware of anything that might have caused this or alternatively asking for any suggestions as to what we should look for in our configs that might have caused it.

mhdawson · 2020-04-06T22:11:57Z

I don't see a lot of config options for the caching, the one that seems to match 4 hours is the browser cache TTL. My initial thought is that a different value for that would have necessarily helped. In a lot of cases if downloads are from CI's those probably start with a fresh environment that won't have the cache anyway.

MattIPv4 · 2020-04-06T22:15:52Z

Updating that will change caching for the entire nodejs.org site which might not be desirable. And yes, that’s controlling browser cache which wouldn’t have much impact on CI, I agree.

I believe configuring a page rule should be possible though, which can be set to only target /dist/ and allow for a custom cache configuration within Cloudflare, so that Cloudflare will cache and serve assets for longer before going back to the origin server on DO to get a new copy.

jgehrcke · 2020-04-06T22:19:10Z

I don't see a lot of config options for the caching,

@mhdawson

In the "origin" server's nginx config, tweak the Cache-Control headers emitted for the dist/ files -- CF will honor that, and can be told to cache the dist/ files for time scales much longer than 4 hours. docs
Be sure to add a stale-if-error header because the world does not want to have a single box in the hot path for nodejs downloads (see https://support.cloudflare.com/hc/en-us/articles/200168436 at the bottom "Allowing your origin web server to determine which content to cache for display if your origin web server is offline")

MattIPv4 · 2020-04-06T22:21:17Z

For example, the page rule might be configured with:

Target: nodejs.org/dist/*
Cache level: Cache everything
Edge cache ttl: a month
Browser cache ttl: a year

targos · 2020-04-06T22:23:18Z

Be careful with these settings. There are files under dist that should not be cached too long.
At least index.json and index.tab

rvagg · 2020-04-06T22:23:59Z

Just diving into this now, here's the variables so far that I can't reconcile:

VM metrics for our main web server over the past 7 days to right now:

The most interesting part of that is the spike in bandwidth that takes us up toward what I think is that similar threshold that we were hitting with DO before we went full caching with CF; in theory we shouldn't be seeing that anymore since CF is supposed to take that load from us. Zooming out to 30 days shows this as an anomaly.

But here's CF in the past 24 hours:

Zoomed out to 7 days:

Looks entirely normal.

Plus we do have load balancing with CF and the CF load balancing logs (available through the dashboard @mhdawson) show zero events! So at no point does CF admit to having to switch between our primary+secondary.

So two puzzles:

Why is DO bandwidth spiking like this?
Why isn't CF reacting to unresponsiveness if there is in fact unresponsiveness on the backend?

I want to blame both of them, but that's kind of weird. For now at least it seems to have subsided.

One scenario where this might be explainable:

Someone has our backend address and is hitting it directly (DoS?)
It's not being hit quite enough to trigger CF LB thresholds

I just don't understand that second bit, if users are experiencing such pain, why wouldn't CF's LB algorithm be kicking in to deal with it.

MattIPv4 · 2020-04-06T22:24:08Z

If modifying the origin is an option, then setting a long max-age for dist, as well as including immutable & stale-if-error should encourage Cloudflare to do much more aggressive caching of these assets.

MattIPv4 · 2020-04-06T22:26:18Z

Someone has our backend address and is hitting it directly (DoS?)

Cloudflare has their full ranges public, if this hadn’t been done already, why not configure the origin with a firewall that only lets those in?

nschonni · 2020-04-06T22:46:10Z

nginx config is over here https://raw.githubusercontent.com/nodejs/build/master/ansible/www-standalone/resources/config/nodejs.org

rvagg · 2020-04-07T02:55:51Z

Interestingly it also seems like the cached bandwidth on cloudflare as a ratio to total traffic is going back up. That started at around 4:30 EST PM today based on the graph. As of 5:15 EST is was down to 14% versus 50% when we saw bad performance. It started to go up in terms of the uncached percentage around 7PM EST on April 5th.

I couldn't see this in the graphs by the time I got to the Cloudflare dashboard--probably because of the way it aggregates so the 24 hour period I was looking at was delayed from @mhdawson's observation.

So, I've used the Cloudflare API to pull out as much data I could. We have a limit on the number of days we can go back unfortunately so I can't see a weekly pattern (which is usually important for our download stats).

1 minute intervals of cachedBytes / bytes, plotted to show the spread gives a very interesting picture. Times are UTC:

We have regular cache purges of nodejs.org content and it's a pretty blunt instrument that we use when we deploy new assets to the website. It's something we need to improve and it explains some variation in cached %. You can see some dips in the graph for previous days that probably (we could check) line up with the deployment of nightly and v8-canary builds that happen daily and cause an purge. But the recovery is quick. The anomaly on this graph shows massive cache invalidation, huge variation across time, and no linear recovery pattern, it's just on then off and back to normal on the right side of the graph.

... Therefore, I'm inclined to believe there's some hiccup in Cloudflare's caching process that caused an anomaly block. That, in turn, hammered our server with a pattern that was very similar to what we had last year that finally forced us to fully front our servers with Cloudflare. But we didn't get any load balancing failover attempts--perhaps our health checks aren't aggressive enough? Perhaps the difference between a health check on /en/ is trivial because that content is in memory for nginx, while everyone who was trying to download a binary had to reach in to disk. So the health check wasn't able to test the full I/O on the server and measure its real stress levels.

rvagg · 2020-04-07T05:06:21Z

Since this is no longer "active" I'll close this and we can move diagnosis discussion to #2264

johnpangalos · 2020-04-07T06:23:11Z

Thanks again for fixing this, really appreciate it!

ryanolee mentioned this issue Apr 6, 2020

Some https://nodejs.org/dist/ builds are experiencing intermittent 500 errors nodejs/node#32683

Closed

johnpangalos closed this as completed Apr 6, 2020

johnpangalos reopened this Apr 6, 2020

damccorm mentioned this issue Apr 6, 2020

Node.js tool installer - 7zr.exe failed with return code: 2 microsoft/azure-pipelines-tasks#11626

Closed

tholu mentioned this issue Apr 6, 2020

Add re-downloading Node archive in case of EOFException (#882) eirslett/frontend-maven-plugin#886

Closed

sam-github pinned this issue Apr 6, 2020

ljharb mentioned this issue Apr 7, 2020

Failed to install with curl nvm-sh/nvm#2189

Closed

rvagg mentioned this issue Apr 7, 2020

downloads from nodejs.org currently very slow #2210

Closed

rvagg closed this as completed Apr 7, 2020

rvagg mentioned this issue Apr 7, 2020

retrospective on download outage #2264

Closed

harriha mentioned this issue Apr 7, 2020

Installs are failing (HTTP 500 from nodejs.org) nodists/nodist#229

Closed

Trott unpinned this issue Apr 7, 2020

Download slow, breaking CI #2263

Download slow, breaking CI #2263

Comments

johnpangalos commented Apr 6, 2020

aviatrix commented Apr 6, 2020

johnpangalos commented Apr 6, 2020

adamayoung commented Apr 6, 2020

mihaimircea commented Apr 6, 2020

johnpangalos commented Apr 6, 2020 • edited Loading

Moussto commented Apr 6, 2020 • edited Loading

johnpangalos commented Apr 6, 2020

johnpangalos commented Apr 6, 2020

ErisDS commented Apr 6, 2020

joepie91 commented Apr 6, 2020 • edited Loading

fishcharlie commented Apr 6, 2020

joepie91 commented Apr 6, 2020 • edited Loading

addaleax commented Apr 6, 2020

MattIPv4 commented Apr 6, 2020

mhdawson commented Apr 6, 2020

tholu commented Apr 6, 2020

mhdawson commented Apr 6, 2020

targos commented Apr 6, 2020

joepie91 commented Apr 6, 2020

Trott commented Apr 6, 2020

mhdawson commented Apr 6, 2020 • edited Loading

Trott commented Apr 6, 2020

mmarchini commented Apr 6, 2020

mkonikov commented Apr 6, 2020

targos commented Apr 6, 2020 • edited Loading

mhdawson commented Apr 6, 2020

sam-github commented Apr 6, 2020

mhdawson commented Apr 6, 2020

jgehrcke commented Apr 6, 2020 • edited Loading

mhdawson commented Apr 6, 2020

mkonikov commented Apr 6, 2020

targos commented Apr 6, 2020

tholu commented Apr 6, 2020

mhdawson commented Apr 6, 2020 • edited Loading

MattIPv4 commented Apr 6, 2020

mhdawson commented Apr 6, 2020

mhdawson commented Apr 6, 2020

MattIPv4 commented Apr 6, 2020

jgehrcke commented Apr 6, 2020 • edited Loading

MattIPv4 commented Apr 6, 2020

targos commented Apr 6, 2020

rvagg commented Apr 6, 2020

MattIPv4 commented Apr 6, 2020

MattIPv4 commented Apr 6, 2020

nschonni commented Apr 6, 2020

rvagg commented Apr 7, 2020

rvagg commented Apr 7, 2020

johnpangalos commented Apr 7, 2020

johnpangalos commented Apr 6, 2020 •

edited

Loading

Moussto commented Apr 6, 2020 •

edited

Loading

joepie91 commented Apr 6, 2020 •

edited

Loading

joepie91 commented Apr 6, 2020 •

edited

Loading

mhdawson commented Apr 6, 2020 •

edited

Loading

targos commented Apr 6, 2020 •

edited

Loading

jgehrcke commented Apr 6, 2020 •

edited

Loading

mhdawson commented Apr 6, 2020 •

edited

Loading

jgehrcke commented Apr 6, 2020 •

edited

Loading