Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download slow, breaking CI #2263

Closed
johnpangalos opened this issue Apr 6, 2020 · 53 comments
Closed

Download slow, breaking CI #2263

johnpangalos opened this issue Apr 6, 2020 · 53 comments

Comments

@johnpangalos
Copy link

Hello,

I noticed that the download speed of node js has become rather slow which breaks a our azure pipeline task. Basically it times out and our build doesn't continue.

Any help with this would be appreciated and let me know if you need anything from me, thanks!

Azure task

@aviatrix
Copy link

aviatrix commented Apr 6, 2020

what i've seen is that nodejs.org itself is erroring out with error 500 for the past 30-40 minutes, now seems to be up again

@johnpangalos
Copy link
Author

Download still seems to be slow unfortunately, even if the site is up.

@adamayoung
Copy link

Also getting this in my Azure Pipeline when trying to install Node 13.x

@mihaimircea
Copy link

Same error for trying to install node 12.16.1, from ubuntu-latest on Azure Pipelines

@johnpangalos
Copy link
Author

johnpangalos commented Apr 6, 2020

I think the problem is that when the download takes too long the node azure pipelines task times out/is unable to use a half downloaded file. Trying the work around of just using the node apt package (I also use ubuntu)

@Moussto
Copy link

Moussto commented Apr 6, 2020

It's being tracked here nodejs/node#32683

@johnpangalos
Copy link
Author

Ok, I'll close then.

@johnpangalos
Copy link
Author

This is still technically a problem, and the more eyes on it the better.

@ErisDS
Copy link

ErisDS commented Apr 6, 2020

This is causing a lot of problems with CI/build tools and in cases where these are charged for computation time, costing people a lot of money:

actions/setup-node#132

@joepie91
Copy link

joepie91 commented Apr 6, 2020

This is causing a lot of problems with CI/build tools and in cases where these are charged for computation time, costing people a lot of money

This sort of comment is really not okay.

It's not the build team's problem nor responsibility if people haven't bothered to set up some local redundancy/caching and the resulting failures are costing them money, especially for a free service. Everyone's business/infrastructure continuity is their own responsibility.

I'm sure they have enough on their mind already, without additional pressure from "this is costing us money!" type comments.

@fishcharlie
Copy link

@joepie91

This sort of comment is really not okay.

It's not the build team's problem nor responsibility if people haven't bothered to set up some local redundancy/caching and the resulting failures are costing them money, especially for a free service. Everyone's business/infrastructure continuity is their own responsibility.

I'm sure they have enough on their mind already, without additional pressure from "this is costing us money!" type comments.

This is fair. I do think it's extremely important to realize the increased frustration due to the size and impact of Node.js. The larger an open source project becomes, the more responsibility it has. I think it's fair to hold a project the size of Node.js to a higher standard than most open source projects.

It's nearly impossible in this day and age to have 0 dependencies and 100% redundancy.

The reality is that it is costing people money. It's also important to understand that yes, in an ideal world, redundancy and caching should exist. But a lot of systems don't have very good support for that. This is not an ideal world.

The Node.js team should take responsibility for this. It's unfair to place all the blame or responsibility on them. I think most of the frustration is due to the lack of updates from the Node.js team. And I believe that frustration is warranted.

For me personally, all I ask is more transparency and updates from the Node.js team.

@joepie91
Copy link

joepie91 commented Apr 6, 2020

The larger an open source project becomes, the more responsibility it has. I think it's fair to hold a project the size of Node.js to a higher standard than most open source projects.

To a degree, yes. But hardware breaks, servers break, and especially when something isn't a service that people are actually paying for and have an availability agreement for, I don't think it's reasonable to expect equivalent service to a commercial service, like people seem to do here (and in other threads about the issue).

The reality is that it is costing people money. It's also important to understand that yes, in an ideal world, redundancy and caching should exist. But a lot of systems don't have very good support for that. This is not an ideal world.

Sure. But it's one's own decision to actually use such imperfect systems without petitioning the vendors to fix them, and the costs for that decision/tradeoff should be shouldered by those picking the systems, not by some random third party that had nothing to do with the decision (Node.js core, in this particular case).

Actually, it's quite bizarre to me that people's build processes are apparently downloading the same thing over and over again without bothering to cache things locally in the first place, needlessly costing the nodejs.org operators money for the bandwidth.

I think most of the frustration is due to the lack of updates from the Node.js team.

An update was provided here. Presumably the status has not changed since. I'm not sure what other updates people are expecting from something that is, again, a free service.

@addaleax
Copy link
Member

addaleax commented Apr 6, 2020

Yeah. I think this is the kind of thing that deserves a post-mortem with concrete steps to stop this from happening again, but for the moment, if there’s anything happening then you’ll see it here.

I think most of the frustration is due to the lack of updates from the Node.js team.

An update was provided here. Presumably the status has not changed since.

💯

@MattIPv4
Copy link
Member

MattIPv4 commented Apr 6, 2020

A status page would likely help immensely with these communication issues and provide a canonical source of truth for updates about an incident: #2265

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

From top it does not look particularly bad, although if cloudflare was caching most of the downloads I'm not sure I'd expect a load average of 2

top - 20:23:11 up 164 days, 17:49,  1 user,  load average: 2.22, 1.87, 1.80
Tasks: 268 total,   3 running, 265 sleeping,   0 stopped,   0 zombie
%Cpu0  : 16.6 us, 10.3 sy,  0.0 ni, 71.5 id,  1.3 wa,  0.0 hi,  0.0 si,  0.3 st
%Cpu1  : 10.3 us, 12.6 sy,  0.0 ni, 73.5 id,  1.6 wa,  0.0 hi,  1.6 si,  0.3 st
%Cpu2  : 14.0 us, 14.0 sy,  0.0 ni, 70.0 id,  1.3 wa,  0.0 hi,  0.3 si,  0.3 st
%Cpu3  : 13.6 us,  9.9 sy,  0.0 ni, 63.9 id,  1.3 wa,  0.0 hi, 10.9 si,  0.3 st
%Cpu4  :  0.0 us,  0.3 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu5  :  0.6 us,  0.6 sy,  0.0 ni, 30.6 id,  0.0 wa,  0.0 hi, 67.4 si,  0.6 st
KiB Mem : 16432060 total,  1626516 free,  1394780 used, 13410764 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 14181820 avail Mem

@tholu
Copy link

tholu commented Apr 6, 2020

Actually, it's quite bizarre to me that people's build processes are apparently downloading the same thing over and over again without bothering to cache things locally in the first place, needlessly costing the nodejs.org operators money for the bandwidth.

I'm sure lots of people in this thread would take the opportunity to setup a local cache/proxy for this (similar to Nexus for Maven Packages). Ideally there would be something like a local Squid proxy cache instance, as in most build pipelines, the URLs are to the NodeJS Binaries are often not easily changeable.

Major problem: Proxying and caching requests to e.g. https://nodejs.org/dist/v13.6.0/node-v13.6.0-linux-x64.tar.gz is not trivial, as it's served via https. @joepie91 Do you have some pointers how to set this up properly? Thank you!

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

I just downloaded 4-5 binaries without problem. Is there something more specific than downloading from nodejs.org/en/download to trigger this?

@targos
Copy link
Member

targos commented Apr 6, 2020

I can trigger this by downloading from that address.

@joepie91
Copy link

joepie91 commented Apr 6, 2020

@joepie91 Do you have some pointers how to set this up properly?

I suspect that in most cases, people will be using proprietary hosted CI services, in which case the first place to look would be that specific service's capabilities for caching sources. I can't advise here, unfortunately.

For those who control their own infrastructure, I think the easiest approach would be to reverse-proxy to nodejs.org, exposing that on one's own subdomain (think nodejs.org.sources.example.com or such), and then just using your own subdomain in the source file URLs rather than nodejs.org. That sidesteps all the TLS issues.

Optionally basic auth could be used, if the proxying server is on a public network, to avoid having half the world proxy through it.

For caching npm, which is another likely failure point, there's off-the-shelf software such as Verdaccio: https://verdaccio.org/

@Trott
Copy link
Member

Trott commented Apr 6, 2020

I just downloaded 4-5 binaries without problem. Is there something more specific than downloading from nodejs.org/en/download to trigger this?

Maybe look at any Cloudflare logs you might have access to?

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

The rate of new entries in /var/log/nginx/acces.log log does not seem to too much maybe 10-15 per minute so most of the traffic must still be being served by cloudflare.

@Trott
Copy link
Member

Trott commented Apr 6, 2020

Or look for 500 in the server logs? I'm being a bit simplistic here, but I imagine we shouldn't get 500s. It's a bit ol' static site, as far as I know.

@mmarchini
Copy link
Contributor

Aren't 500 stored on /var/log/nginx/error.log?

@mkonikov
Copy link

mkonikov commented Apr 6, 2020

In our CI get this often

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
xz: (stdin): Unexpected end of input
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
cp: cannot stat ‘/usr/local/n/versions/node/10.16.0/lib’: No such file or directory
cp: cannot stat ‘/usr/local/n/versions/node/10.16.0/include’: No such file or directory

or

curl: (22) The requested URL returned error: 500 

  Error: download preflight failed for '10.16.0' (https://nodejs.org/dist/v10.16.0/node-v10.16.0-linux-x64.tar.xz)

@targos
Copy link
Member

targos commented Apr 6, 2020

@mhdawson have a look at /var/log/nginx/nodejs/nodejs.org-access.log and /var/log/nginx/nodejs/nodejs.org-error.log

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

Going to see if there is anything in the cloudflare logs

@sam-github
Copy link
Contributor

HTTP/2 stream 0 was not closed cleanly

^--- that's interesting, do we always serve over HTTP/2? Is it possible something changed wrt. that?

@sam-github sam-github pinned this issue Apr 6, 2020
@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

One thing on cloudflare that stands out so far is that for the last month only 12% of the traffic was uncached, and stat for last week is similar. Uncached for the last 6 hours is 48%

@jgehrcke
Copy link

jgehrcke commented Apr 6, 2020

I don't think that Cloudflare in front of a single box is the right solution for serving these dist files when robustness is the goal.

Even small hiccups in the backend are exposed to end users.

The single machine that we talk about here is in the hot path more often than it should be. For Cloudflare to fall back to serving from cache while the backend is not available very specific conditions need to be met: https://support.cloudflare.com/hc/en-us/articles/200168436-Understanding-Cloudflare-Always-Online -- specifically, transport layer problems do not trigger "always on", and HTTP responses with status code 500 also don't. These problems are seen by "end users" (well, those HTTP clients trying to download).

I myself have Cloudflare in front of a single box and was often surprised how this setup does not magically heal transient problems in the backend.

Let's strive towards a solution where the backend is never in the hot path.

Typically, dist files like the ones we talk about here should be served by a real CDN like fastly (which sponsors the hosting for Python), cloudfront, Google Cloud CDN, etc.

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

Are people still seeing the issue? I have yet to have a download be slow or fail?

@mkonikov
Copy link

mkonikov commented Apr 6, 2020

In our CI our build at 3:52:30 pm ET failed. Running another one now and will advise if run into issues.

@targos
Copy link
Member

targos commented Apr 6, 2020

@mhdawson I seems back to normal. The load on the server is much lower than before, SSH stopped lagging and kswapd0 / ksoftirqd/0 no longer use the CPU.

@tholu
Copy link

tholu commented Apr 6, 2020

@mhdawson Seems fast again now for me, just in time as I set up a simple caching of the needed artifacts. How did you fix it?

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

Interestingly it also seems like the cached bandwidth on cloudflare as a ratio to total traffic is going back up. That started at around 4:30 EST PM today based on the graph. As of 5:15 EST is was down to 14% versus 50% when we saw bad performance. It started to go up in terms of the uncached percentage around 7PM EST on April 5th.

As @targos mentioned the load on our machines also seems lower which would make sense if more of the load is being cached.

@tholu I wish I could say I fixed something but I've not changed anything. Was just ramping up looking at the logs and trying to figure out what was going on :).

There were some outages reported for cloudflare today, but nothing that seems related or with times that match the period of time we saw the higher uncached bandwidth.

Apr 6, 2020
Issues enabling DNSSEC
Resolved - Cloudflare has resolved the issue and services have resumed normal operation.
Apr 6, 13:44 UTC
Monitoring - Cloudflare has implemented a fix for this issue and is currently monitoring the results.

We will update the status once the issue is resolved.
Apr 6, 13:35 UTC
Investigating - Cloudflare is investigating issues with enabling DNSSEC.

This does not affect existing DNSSEC configurations, which are working as expected.

We are working to understand the full impact and mitigate this problem. More updates to follow shortly.
Apr 6, 12:53 UTC
Performance Issues in Frankfurt, Germany
Resolved - This incident has been resolved.
Apr 6, 12:12 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Apr 6, 12:05 UTC
Investigating - We are investigating possible performance issues in Frankfurt, Germany (FRA)
Apr 6, 11:29 UTC

Our cloudflare logs are being pushed to google storage and I've not yet found how to get those, if things are working better now then looking at those can probably wait until @rvagg is online and can take a look or point me to instructions/doc as to how they are stored.

Looking further back in the access logs on our machine I do see lots of
500 errors for downloads. Tailing the access log file now I don't see any.

My main guess is that for some reason less of the traffic was served from the cache between 7PM EST on April 5th until today around 4:30 EST where it started to go back to normal.

@MattIPv4
Copy link
Member

MattIPv4 commented Apr 6, 2020

A suggestion for improving caching here would be to increase the age on the cache control header for /dist/*.

As I noted in Slack, it seems these are currently served with a max-age of four hours, which seems incredibly low considering these files will never change. Why not set something much higher, like a year?

I have no insight into the Cloudflare config, but it’s worth also checking caching for these files is enabled there. Maybe setup a page rule to ensure they have a long cache configured via Cloudflare directly as well?

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

Submitted ticket 1864572 with Cloudflare to see if they are aware of anything that might have caused this or alternatively asking for any suggestions as to what we should look for in our configs that might have caused it.

@mhdawson
Copy link
Member

mhdawson commented Apr 6, 2020

I don't see a lot of config options for the caching, the one that seems to match 4 hours is the browser cache TTL. My initial thought is that a different value for that would have necessarily helped. In a lot of cases if downloads are from CI's those probably start with a fresh environment that won't have the cache anyway.

@MattIPv4
Copy link
Member

MattIPv4 commented Apr 6, 2020

Updating that will change caching for the entire nodejs.org site which might not be desirable. And yes, that’s controlling browser cache which wouldn’t have much impact on CI, I agree.

I believe configuring a page rule should be possible though, which can be set to only target /dist/ and allow for a custom cache configuration within Cloudflare, so that Cloudflare will cache and serve assets for longer before going back to the origin server on DO to get a new copy.

@jgehrcke
Copy link

jgehrcke commented Apr 6, 2020

I don't see a lot of config options for the caching,

@mhdawson

  • In the "origin" server's nginx config, tweak the Cache-Control headers emitted for the dist/ files -- CF will honor that, and can be told to cache the dist/ files for time scales much longer than 4 hours. docs
  • Be sure to add a stale-if-error header because the world does not want to have a single box in the hot path for nodejs downloads (see https://support.cloudflare.com/hc/en-us/articles/200168436 at the bottom "Allowing your origin web server to determine which content to cache for display if your origin web server is offline")

@MattIPv4
Copy link
Member

MattIPv4 commented Apr 6, 2020

For example, the page rule might be configured with:

Target: nodejs.org/dist/*
Cache level: Cache everything
Edge cache ttl: a month
Browser cache ttl: a year

@targos
Copy link
Member

targos commented Apr 6, 2020

Be careful with these settings. There are files under dist that should not be cached too long.
At least index.json and index.tab

@rvagg
Copy link
Member

rvagg commented Apr 6, 2020

Just diving into this now, here's the variables so far that I can't reconcile:

VM metrics for our main web server over the past 7 days to right now:

20200407_do_www_metrics

The most interesting part of that is the spike in bandwidth that takes us up toward what I think is that similar threshold that we were hitting with DO before we went full caching with CF; in theory we shouldn't be seeing that anymore since CF is supposed to take that load from us. Zooming out to 30 days shows this as an anomaly.

But here's CF in the past 24 hours:

20200407_cf_24h_www_metrics

Zoomed out to 7 days:

20200407_cf_7d_www_metrics

Looks entirely normal.

Plus we do have load balancing with CF and the CF load balancing logs (available through the dashboard @mhdawson) show zero events! So at no point does CF admit to having to switch between our primary+secondary.

So two puzzles:

  • Why is DO bandwidth spiking like this?
  • Why isn't CF reacting to unresponsiveness if there is in fact unresponsiveness on the backend?

I want to blame both of them, but that's kind of weird. For now at least it seems to have subsided.

One scenario where this might be explainable:

  • Someone has our backend address and is hitting it directly (DoS?)
  • It's not being hit quite enough to trigger CF LB thresholds

I just don't understand that second bit, if users are experiencing such pain, why wouldn't CF's LB algorithm be kicking in to deal with it.

@MattIPv4
Copy link
Member

MattIPv4 commented Apr 6, 2020

If modifying the origin is an option, then setting a long max-age for dist, as well as including immutable & stale-if-error should encourage Cloudflare to do much more aggressive caching of these assets.

@MattIPv4
Copy link
Member

MattIPv4 commented Apr 6, 2020

Someone has our backend address and is hitting it directly (DoS?)

Cloudflare has their full ranges public, if this hadn’t been done already, why not configure the origin with a firewall that only lets those in?

@nschonni
Copy link
Member

nschonni commented Apr 6, 2020

@rvagg
Copy link
Member

rvagg commented Apr 7, 2020

Interestingly it also seems like the cached bandwidth on cloudflare as a ratio to total traffic is going back up. That started at around 4:30 EST PM today based on the graph. As of 5:15 EST is was down to 14% versus 50% when we saw bad performance. It started to go up in terms of the uncached percentage around 7PM EST on April 5th.

I couldn't see this in the graphs by the time I got to the Cloudflare dashboard--probably because of the way it aggregates so the 24 hour period I was looking at was delayed from @mhdawson's observation.

So, I've used the Cloudflare API to pull out as much data I could. We have a limit on the number of days we can go back unfortunately so I can't see a weekly pattern (which is usually important for our download stats).

1 minute intervals of cachedBytes / bytes, plotted to show the spread gives a very interesting picture. Times are UTC:

20200407_cf_cached_percent_1m_samples

We have regular cache purges of nodejs.org content and it's a pretty blunt instrument that we use when we deploy new assets to the website. It's something we need to improve and it explains some variation in cached %. You can see some dips in the graph for previous days that probably (we could check) line up with the deployment of nightly and v8-canary builds that happen daily and cause an purge. But the recovery is quick. The anomaly on this graph shows massive cache invalidation, huge variation across time, and no linear recovery pattern, it's just on then off and back to normal on the right side of the graph.

... Therefore, I'm inclined to believe there's some hiccup in Cloudflare's caching process that caused an anomaly block. That, in turn, hammered our server with a pattern that was very similar to what we had last year that finally forced us to fully front our servers with Cloudflare. But we didn't get any load balancing failover attempts--perhaps our health checks aren't aggressive enough? Perhaps the difference between a health check on /en/ is trivial because that content is in memory for nginx, while everyone who was trying to download a binary had to reach in to disk. So the health check wasn't able to test the full I/O on the server and measure its real stress levels.

@rvagg
Copy link
Member

rvagg commented Apr 7, 2020

Since this is no longer "active" I'll close this and we can move diagnosis discussion to #2264

@johnpangalos
Copy link
Author

Thanks again for fixing this, really appreciate it!

@Trott Trott unpinned this issue Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests