-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download slow, breaking CI #2263
Comments
what i've seen is that nodejs.org itself is erroring out with error 500 for the past 30-40 minutes, now seems to be up again |
Download still seems to be slow unfortunately, even if the site is up. |
Also getting this in my Azure Pipeline when trying to install Node 13.x |
Same error for trying to install node 12.16.1, from ubuntu-latest on Azure Pipelines |
I think the problem is that when the download takes too long the node azure pipelines task times out/is unable to use a half downloaded file. Trying the work around of just using the node apt package (I also use ubuntu) |
It's being tracked here nodejs/node#32683 |
Ok, I'll close then. |
This is still technically a problem, and the more eyes on it the better. |
This is causing a lot of problems with CI/build tools and in cases where these are charged for computation time, costing people a lot of money: |
This sort of comment is really not okay. It's not the build team's problem nor responsibility if people haven't bothered to set up some local redundancy/caching and the resulting failures are costing them money, especially for a free service. Everyone's business/infrastructure continuity is their own responsibility. I'm sure they have enough on their mind already, without additional pressure from "this is costing us money!" type comments. |
This is fair. I do think it's extremely important to realize the increased frustration due to the size and impact of Node.js. The larger an open source project becomes, the more responsibility it has. I think it's fair to hold a project the size of Node.js to a higher standard than most open source projects. It's nearly impossible in this day and age to have 0 dependencies and 100% redundancy. The reality is that it is costing people money. It's also important to understand that yes, in an ideal world, redundancy and caching should exist. But a lot of systems don't have very good support for that. This is not an ideal world. The Node.js team should take responsibility for this. It's unfair to place all the blame or responsibility on them. I think most of the frustration is due to the lack of updates from the Node.js team. And I believe that frustration is warranted. For me personally, all I ask is more transparency and updates from the Node.js team. |
To a degree, yes. But hardware breaks, servers break, and especially when something isn't a service that people are actually paying for and have an availability agreement for, I don't think it's reasonable to expect equivalent service to a commercial service, like people seem to do here (and in other threads about the issue).
Sure. But it's one's own decision to actually use such imperfect systems without petitioning the vendors to fix them, and the costs for that decision/tradeoff should be shouldered by those picking the systems, not by some random third party that had nothing to do with the decision (Node.js core, in this particular case). Actually, it's quite bizarre to me that people's build processes are apparently downloading the same thing over and over again without bothering to cache things locally in the first place, needlessly costing the nodejs.org operators money for the bandwidth.
An update was provided here. Presumably the status has not changed since. I'm not sure what other updates people are expecting from something that is, again, a free service. |
Yeah. I think this is the kind of thing that deserves a post-mortem with concrete steps to stop this from happening again, but for the moment, if there’s anything happening then you’ll see it here.
💯 |
A status page would likely help immensely with these communication issues and provide a canonical source of truth for updates about an incident: #2265 |
From top it does not look particularly bad, although if cloudflare was caching most of the downloads I'm not sure I'd expect a load average of 2 top - 20:23:11 up 164 days, 17:49, 1 user, load average: 2.22, 1.87, 1.80
Tasks: 268 total, 3 running, 265 sleeping, 0 stopped, 0 zombie
%Cpu0 : 16.6 us, 10.3 sy, 0.0 ni, 71.5 id, 1.3 wa, 0.0 hi, 0.0 si, 0.3 st
%Cpu1 : 10.3 us, 12.6 sy, 0.0 ni, 73.5 id, 1.6 wa, 0.0 hi, 1.6 si, 0.3 st
%Cpu2 : 14.0 us, 14.0 sy, 0.0 ni, 70.0 id, 1.3 wa, 0.0 hi, 0.3 si, 0.3 st
%Cpu3 : 13.6 us, 9.9 sy, 0.0 ni, 63.9 id, 1.3 wa, 0.0 hi, 10.9 si, 0.3 st
%Cpu4 : 0.0 us, 0.3 sy, 0.0 ni, 97.7 id, 0.0 wa, 0.0 hi, 2.0 si, 0.0 st
%Cpu5 : 0.6 us, 0.6 sy, 0.0 ni, 30.6 id, 0.0 wa, 0.0 hi, 67.4 si, 0.6 st
KiB Mem : 16432060 total, 1626516 free, 1394780 used, 13410764 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14181820 avail Mem |
I'm sure lots of people in this thread would take the opportunity to setup a local cache/proxy for this (similar to Nexus for Maven Packages). Ideally there would be something like a local Squid proxy cache instance, as in most build pipelines, the URLs are to the NodeJS Binaries are often not easily changeable. Major problem: Proxying and caching requests to e.g. https://nodejs.org/dist/v13.6.0/node-v13.6.0-linux-x64.tar.gz is not trivial, as it's served via https. @joepie91 Do you have some pointers how to set this up properly? Thank you! |
I just downloaded 4-5 binaries without problem. Is there something more specific than downloading from nodejs.org/en/download to trigger this? |
I can trigger this by downloading from that address. |
I suspect that in most cases, people will be using proprietary hosted CI services, in which case the first place to look would be that specific service's capabilities for caching sources. I can't advise here, unfortunately. For those who control their own infrastructure, I think the easiest approach would be to reverse-proxy to nodejs.org, exposing that on one's own subdomain (think Optionally basic auth could be used, if the proxying server is on a public network, to avoid having half the world proxy through it. For caching npm, which is another likely failure point, there's off-the-shelf software such as Verdaccio: https://verdaccio.org/ |
Maybe look at any Cloudflare logs you might have access to? |
The rate of new entries in /var/log/nginx/acces.log log does not seem to too much maybe 10-15 per minute so most of the traffic must still be being served by cloudflare. |
Or look for 500 in the server logs? I'm being a bit simplistic here, but I imagine we shouldn't get 500s. It's a bit ol' static site, as far as I know. |
Aren't 500 stored on |
In our CI get this often
or
|
@mhdawson have a look at |
Going to see if there is anything in the cloudflare logs |
^--- that's interesting, do we always serve over HTTP/2? Is it possible something changed wrt. that? |
One thing on cloudflare that stands out so far is that for the last month only 12% of the traffic was uncached, and stat for last week is similar. Uncached for the last 6 hours is 48% |
I don't think that Cloudflare in front of a single box is the right solution for serving these dist files when robustness is the goal. Even small hiccups in the backend are exposed to end users. The single machine that we talk about here is in the hot path more often than it should be. For Cloudflare to fall back to serving from cache while the backend is not available very specific conditions need to be met: https://support.cloudflare.com/hc/en-us/articles/200168436-Understanding-Cloudflare-Always-Online -- specifically, transport layer problems do not trigger "always on", and HTTP responses with status code 500 also don't. These problems are seen by "end users" (well, those HTTP clients trying to download). I myself have Cloudflare in front of a single box and was often surprised how this setup does not magically heal transient problems in the backend. Let's strive towards a solution where the backend is never in the hot path. Typically, dist files like the ones we talk about here should be served by a real CDN like fastly (which sponsors the hosting for Python), cloudfront, Google Cloud CDN, etc. |
Are people still seeing the issue? I have yet to have a download be slow or fail? |
In our CI our build at |
@mhdawson I seems back to normal. The load on the server is much lower than before, SSH stopped lagging and |
@mhdawson Seems fast again now for me, just in time as I set up a simple caching of the needed artifacts. How did you fix it? |
Interestingly it also seems like the cached bandwidth on cloudflare as a ratio to total traffic is going back up. That started at around 4:30 EST PM today based on the graph. As of 5:15 EST is was down to 14% versus 50% when we saw bad performance. It started to go up in terms of the uncached percentage around 7PM EST on April 5th. As @targos mentioned the load on our machines also seems lower which would make sense if more of the load is being cached. @tholu I wish I could say I fixed something but I've not changed anything. Was just ramping up looking at the logs and trying to figure out what was going on :). There were some outages reported for cloudflare today, but nothing that seems related or with times that match the period of time we saw the higher uncached bandwidth.
Our cloudflare logs are being pushed to google storage and I've not yet found how to get those, if things are working better now then looking at those can probably wait until @rvagg is online and can take a look or point me to instructions/doc as to how they are stored. Looking further back in the access logs on our machine I do see lots of My main guess is that for some reason less of the traffic was served from the cache between 7PM EST on April 5th until today around 4:30 EST where it started to go back to normal. |
A suggestion for improving caching here would be to increase the age on the cache control header for /dist/*. As I noted in Slack, it seems these are currently served with a max-age of four hours, which seems incredibly low considering these files will never change. Why not set something much higher, like a year? I have no insight into the Cloudflare config, but it’s worth also checking caching for these files is enabled there. Maybe setup a page rule to ensure they have a long cache configured via Cloudflare directly as well? |
Submitted ticket 1864572 with Cloudflare to see if they are aware of anything that might have caused this or alternatively asking for any suggestions as to what we should look for in our configs that might have caused it. |
I don't see a lot of config options for the caching, the one that seems to match 4 hours is the browser cache TTL. My initial thought is that a different value for that would have necessarily helped. In a lot of cases if downloads are from CI's those probably start with a fresh environment that won't have the cache anyway. |
Updating that will change caching for the entire nodejs.org site which might not be desirable. And yes, that’s controlling browser cache which wouldn’t have much impact on CI, I agree. I believe configuring a page rule should be possible though, which can be set to only target /dist/ and allow for a custom cache configuration within Cloudflare, so that Cloudflare will cache and serve assets for longer before going back to the origin server on DO to get a new copy. |
|
For example, the page rule might be configured with: Target: nodejs.org/dist/* |
Be careful with these settings. There are files under |
Just diving into this now, here's the variables so far that I can't reconcile: VM metrics for our main web server over the past 7 days to right now: The most interesting part of that is the spike in bandwidth that takes us up toward what I think is that similar threshold that we were hitting with DO before we went full caching with CF; in theory we shouldn't be seeing that anymore since CF is supposed to take that load from us. Zooming out to 30 days shows this as an anomaly. But here's CF in the past 24 hours: Zoomed out to 7 days: Looks entirely normal. Plus we do have load balancing with CF and the CF load balancing logs (available through the dashboard @mhdawson) show zero events! So at no point does CF admit to having to switch between our primary+secondary. So two puzzles:
I want to blame both of them, but that's kind of weird. For now at least it seems to have subsided. One scenario where this might be explainable:
I just don't understand that second bit, if users are experiencing such pain, why wouldn't CF's LB algorithm be kicking in to deal with it. |
If modifying the origin is an option, then setting a long max-age for dist, as well as including immutable & stale-if-error should encourage Cloudflare to do much more aggressive caching of these assets. |
Cloudflare has their full ranges public, if this hadn’t been done already, why not configure the origin with a firewall that only lets those in? |
nginx config is over here https://raw.githubusercontent.com/nodejs/build/master/ansible/www-standalone/resources/config/nodejs.org |
I couldn't see this in the graphs by the time I got to the Cloudflare dashboard--probably because of the way it aggregates so the 24 hour period I was looking at was delayed from @mhdawson's observation. So, I've used the Cloudflare API to pull out as much data I could. We have a limit on the number of days we can go back unfortunately so I can't see a weekly pattern (which is usually important for our download stats). 1 minute intervals of We have regular cache purges of nodejs.org content and it's a pretty blunt instrument that we use when we deploy new assets to the website. It's something we need to improve and it explains some variation in cached %. You can see some dips in the graph for previous days that probably (we could check) line up with the deployment of nightly and v8-canary builds that happen daily and cause an purge. But the recovery is quick. The anomaly on this graph shows massive cache invalidation, huge variation across time, and no linear recovery pattern, it's just on then off and back to normal on the right side of the graph. ... Therefore, I'm inclined to believe there's some hiccup in Cloudflare's caching process that caused an anomaly block. That, in turn, hammered our server with a pattern that was very similar to what we had last year that finally forced us to fully front our servers with Cloudflare. But we didn't get any load balancing failover attempts--perhaps our health checks aren't aggressive enough? Perhaps the difference between a health check on |
Since this is no longer "active" I'll close this and we can move diagnosis discussion to #2264 |
Thanks again for fixing this, really appreciate it! |
Hello,
I noticed that the download speed of node js has become rather slow which breaks a our azure pipeline task. Basically it times out and our build doesn't continue.
Any help with this would be appreciated and let me know if you need anything from me, thanks!
Azure task
The text was updated successfully, but these errors were encountered: