Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status page for NodeJS #2265

Closed
MattIPv4 opened this issue Apr 6, 2020 · 52 comments
Closed

Status page for NodeJS #2265

MattIPv4 opened this issue Apr 6, 2020 · 52 comments
Labels

Comments

@MattIPv4
Copy link
Member

MattIPv4 commented Apr 6, 2020

Latest status for the new status page

The experimental and very W.I.P status page is now available at https://status.nodejs.org/

We are currently working on establishing the policy for what should be posted on the status page and who should have access to it.

Open tasks & questions

Any discussion not related to these specific topics should be kept in this issue (unless a dedicated issue is then created for said topic).

Draft PRs

These PRs should not be landed until the status page is ready to go completely.


Original issue

Hey folks,

The current crisis (nodejs.org intermittently returning 500s) has prompted a thought that perhaps having a proper status page for NodeJS would be a great canonical location to be able to communicate with the community that the NodeJS folks are aware of issues.

This would help with reducing the number of issues being created for the incident and would reassure the wider NodeJS community that folks here are looking into any issues.

I'm not proposing any automation here (unless someone wishes to add metrics) -- automation of status pages is often frowned upon, as it then isn't that helpful, only showing what folks already know, instead of acting as an acknowledgement that humans on the project know something is wrong.

As NodeJS is an established open source project, it should be possible to get an open-source license from Atlassian to use StatusPage: https://www.atlassian.com/software/views/open-source-license-request -- I have previously done this for cdnjs (status.cdnjs.com) and it works perfectly.


As a POC, assuming blessing from NodeJS folks, I would be happy to reach out to Atlassian and start the licensing process. Once acquired, I'm quite happy to stand up a POC status page on status.node.js.org (js.org, not the node-owned domain) and build out the page (styling, components, etc.).

Once that's done, the page can then be reviewed and if happy with it, it can be moved over to status.nodejs.org (all that's required is a CNAME to the statuspage site), and folks that need access can be invited.

I think it'd be best to invite as many NodeJS folks as possible, so that during an incident hopefully someone will be around with access to post updates to the status page.

@daniel-white
Copy link

It would be great if there was twitter account that one could follow the status updates akin to githubstatus.

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 6, 2020

Yeah, totally. StatusPage comes with Twitter support (@cdnjsStatus for example). I think that'd likely be a step after getting the initial status page stood up, but shouldn't be hard.

@joepie91
Copy link

joepie91 commented Apr 6, 2020

As NodeJS is an established open source project, it should be possible to get an open-source license from Atlassian to use StatusPage: https://www.atlassian.com/software/views/open-source-license-request -- I have previously done this for cdnjs (status.cdnjs.com) and it works perfectly.

I don't think it's necessary (nor desirable) to add proprietary infrastructure like this, to what's supposed to be an open-source project.

There are plenty of open-source off-the-shelf status page systems readily available, such as Cachet.

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 6, 2020

Sure, using an open-source alternative would be an option. This though then requires more infra on the NodeJS side to run it and could end up in the same basket as the incident itself if there is a major outage.

StatusPage is the industry standard for this, has a free licensing option dedicated to open source projects such as NodeJS, and is off-site so that it will very likely remain up even if NodeJS has a major outage.

@joepie91
Copy link

joepie91 commented Apr 6, 2020

This though then requires more infra on the NodeJS side to run it and could end up in the same basket as the incident itself if there is a major outage.

It's commonplace to host the status page on different infrastructure than the thing that the status page is for. That can be something as simple as a $3/month VPS at a random provider. I don't think this is an unreasonable ask.

Sure, the status page can still go down, but the same is true for a hosted service. In fact, a hosted service would give you less control over infrastructure separation, since the provider may choose to internally move their systems to the same infrastructure that your things are running on, and you cannot control that.

has a free licensing option dedicated to open source projects such as NodeJS

Sure, but it is still proprietary infrastructure, plus such a license could be revoked at any time with no recourse. IMO there's a pretty high bar to consider making an open-source project depend on proprietary infrastructure, considering that it negatively effects eg. the forkability of a project (as the infrastructure of a project can't necessarily be freely replicated anymore).

Considering that there are polished off-the-shelf open-source options available here, and hosting them is low-complexity, I don't think that bar is met here. And it's a bit strange to me to immediately suggest using a proprietary service without investigating open-source options first.

Edit: Would those liberally sprinkling around downvotes, care to make an actual comment? Just downvoting things is not helpful at all.

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 6, 2020

I don't disagree that using an OSS option has its benefits, as you've outlined.

However, as I mentioned before, StatusPage is the industry standard for this and has a rather proven track record for being a good option. There's a reason they're used by a large body of reputable companies (e.g. https://www.cloudflarestatus.com/, https://status.digitalocean.com/, https://www.githubstatus.com/).

I've gone through a very similar process to this when setting cdnjs up with a status page, and StatusPage became the chosen option then, so it has my absolute recommendation here.

@joepie91
Copy link

joepie91 commented Apr 6, 2020

I've gone through a very similar process to this when setting cdnjs up with a status page, and StatusPage became the chosen option then, so it has my absolute recommendation here.

Do you have any documentation of the process and the considered options and their pros and cons anywhere?

Technology decisions don't generally carry over between projects because different projects have different requirements, so it's important to look at the actual underlying considerations here.

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 6, 2020

Do you have any documentation of the process and the considered options and their pros and cons anywhere?

Nothing documented, as this decision was made internally. Essentially though the considerations were costs involved, reliability & confidence that the option would work out of the box. StatusPage won out there as it was free, is incredibly reliable as demonstrated through its mass of customers and I trusted it to just work, again, based on the incredible number of reputable customers.

@sam-github
Copy link
Contributor

This will come up in the retrospective, #2264 , but from my point of view we have issues pinned at the top of https://github.com/nodejs/node/issues and the top of (since I just did it) of https://github.com/nodejs/build/issues.. I appreciate that people would like notifications from their tool of choice (status page, twitter, etc.), but that seems a big ask for a project of this size (size in terms of people volunteering to help maintain the infrastructure).

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 6, 2020

I personally don't think a pinned issue on one or two repositories should be regarded as sufficient communication/notification for such incidents.

A status page would appear to the average user as a far more official and canonical source of information for incidents (imo) and would allow folks to quickly find official updates, which is rather hard in the messy issues full of "also seeing this" etc.

but that seems a big ask for a project of this size (size in terms of people volunteering to help maintain the infrastructure).

I honestly don't see this as a big ask at all. Assuming the use of StatusPage, there isn't really any requirement for folks to maintain any infra, StatusPage look after all that. The only other ask would then be for someone to ensure an incident is posted on there when something goes wrong and to update it as appropriate until the incident is resolved.

@fishcharlie
Copy link

Wouldn't StatusPage also help alert the Node.js team members in a quicker fashion? Maybe that could help jump start on fixing things instead of having to wait for user reports of a problem?

I do agree tho. For outages, GitHub issues doesn't feel like an obvious place I'd check for that information.

@joepie91
Copy link

joepie91 commented Apr 6, 2020

A status page would appear to the average user as a far more official and canonical source of information for incidents (imo) and would allow folks to quickly find official updates, which is rather hard in the messy issues full of "also seeing this" etc.

Can this not just be addressed by pinning and locking the thread?

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 6, 2020

Wouldn't StatusPage also help alert the Node.js team members in a quicker fashion? Maybe that could help jump start on fixing things instead of having to wait for user reports of a problem?

Twofold, depending on whether automation is used. If automation is used and detects the outage, it could be set up to automatically mark a component as having an issue. Anyone should then be able to subscribe to that to get alerts through their service of choice.

Equally, StatusPage has really good support for firing off alerts whenever an incident is posted or updated, which again anyone could subscribe to so that they can be notified when something happens.

Can this not just be addressed by pinning and locking the thread?

Sure. That helps a significant amount (though it hasn't been done here and resulted in a very messy set of threads). However, as I said above I personally don't feel that GitHub issues are the correct medium to be used for declaring & tracking incidents of this scale for the greater community.

@sam-github
Copy link
Contributor

The only other ask would then be for someone to ensure an incident is posted on there when something goes wrong and to update it as appropriate until the incident is resolved.

@MattIPv4 Not to put too fine a point on it, but you've seen the team size, are you volunteering to do this? If not, which of the half-dozen or less active build-wg members are you proposing to be responsible for this, in addition to everything they already do?

https://github.com/nodejs/build/issues <--- 158 open issues, including this one. There are lots of things we'd like to do, but without people, progress will remain slow. That doesn't make us any happier than it make anyone else, but that's where we are for the moment.

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 6, 2020

Absolutely would be more than happy to take that task on (as well as setting it all up, just need approval to get started on that) :)

As I mentioned in the opening post though, I think access should be given to as many folks as possible, so that there is a much higher chance that a first responder to any incident can post to the status page if needed.

@MylesBorins
Copy link
Contributor

Hey @brianwarner would you be able to fill out https://www.atlassian.com/software/views/open-source-license-request to get the Node.js project an open source license so we can start experimenting with status-page to see if we want to adopt it as a project?

Feel free to ping me privately if you have any questions

@MylesBorins
Copy link
Contributor

@MattIPv4 would you want to present on how you have used status page in the past? The build working group would love to hear from you. We can find a time to meet that is flexible to your schedule.

@MattIPv4
Copy link
Member Author

Not the greatest at presenting, I must admit. Would be happy to have more of a chat though to answer Qs folks might have about the use of status page (both how I've used it in the past, but also how it'd fit in with nodejs), if that works for y'all?

Maybe it'd work best as a general meeting for folks to discuss this issue and #2274?

@MylesBorins
Copy link
Contributor

@MattIPv4 maybe you and I (and other interested) could just have a short call to ask / answer questions. Would that be a better ask?

@MattIPv4
Copy link
Member Author

Sure thing, @MylesBorins. Would feel far more comfortable with that over presenting :)

@brianwarner
Copy link

It looks like we need a trial instance before we can request a full license. This appear to be free, and would help Atlassian evaluate the full open source license.

Can I suggest setting this up first, and then I can run with it? https://manage.statuspage.io/signup

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 14, 2020

I already have a trial instance created (https://nodejs1.statuspage.io/), but am not overly familiar with Node infra so unsure what exactly will want to be added to it.

For now I just have two components: "Website" & "Downloads" (whilst technically the same thing, I think there might be times where differentiating would be helpful?).

@MylesBorins
Copy link
Contributor

@MattIPv4 was nodejs.statuspage.io not available? We likely to register with a foundation email address, probably a mailing list that has build team subscribed to it?

Thoughts @nodejs/build and @brianwarner

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 14, 2020

It would appear nodejs.statuspage.io was already taken. Once we start the licensing process we can likely ask the status page folks to release that and move us over to it, though it shouldn't matter assuming we use a CNAME for status.nodejs.org.

Happy to invite a foundation email and set it as the owner of the page if needed!

Edit: Worth noting the the free trial version of StatusPage is rather limited, for example I can't apply custom styling to the site and can only invite one other email address.

@sam-github
Copy link
Contributor

@MattIPv4 how is the downloads known to be "up" on that page? Occaisonaly polls, or somehow asking cloudfare?

From the last outage, my understanding that randomly hitting the site was likely to work, because it usually served files, and only failed occaisonally, though often enough to break CIs doing continuous downloads. Or perhaps I misunderstand the symptoms?

@MattIPv4
Copy link
Member Author

The status page is now live on the Node.js subdomain: https://status.nodejs.org/

Does anyone have suggestions for what components should be listed here, or is website & downloads enough?

@AshCripps
Copy link
Member

@MattIPv4 Does status page have like a list of integrations etc available?

Think it would be good if we could include maybe some CI machine status's (like machine X is unpingable, unavailable in jenkins etc.) or even the status of our CI (@sam-github have been looking for some way to monitor our dailys better for some time) or whether the last nightly build failed.

I think it would be a good idea to have as much available in a single place as possible to help cut down investigation or things not being noticed untill it really hits the fan. But I dont know if some of this information we dont want publicy available for any reason.

thoughts @nodejs/build

@MattIPv4
Copy link
Member Author

For automation of components, these help docs cover the services they recommend for it (though anything that can send an email can be used to automate a component): https://help.statuspage.io/help/automating-components

I think it'll be important on the status page though that we distinguish between components that have automation and those that don't, so it's obvious to external folks whether we've marked something as having an outage (and thus are aware it's broken) or whether automation has done it.

@bnb
Copy link

bnb commented Apr 20, 2020

@AshCripps I investigated statuspage when this happened and noticed that you can also have your dependencies' statuspage sites in there if they have one - may be worth it to add CloudFlare and any other services we use that have one?

@MattIPv4
Copy link
Member Author

There is a long list of third-party components that we can add, however, it essentially just exposes the components that their status page has. So, we can add any of the components from their status page to ours.

Eg. Cloudflare's components we can add: https://i.cdnjs.dev/toazr3jKmu.png

@bnb
Copy link

bnb commented Apr 20, 2020

There is a long list of third-party components that we can add, however, it essentially just exposes the components that their status page has. So, we can add any of the components from their status page to ours.

Given that there have been multiple instances where at least CloudFlare has been the cause of the issue for us, it's probably worth at the very least adding relevant/all components to the list - especially if we can section them off.

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 20, 2020

Yeah, we can add them to a group on our page. What are the relevant parts of Cf for Node?

  • CDN/Cache
  • DNS Root Servers
  • Cloudflare Authoritative DNS
  • Load Balancing and Monitoring
  • Cloudflare Logs

I guess there isn't any point in including specific Cf regions.

Do we also want to add some DigitalOcean components? Maybe the region(s) where the nodejs.org origins live, overall networking?

@targos
Copy link
Member

targos commented Apr 20, 2020

Is it easy to allow multiple people to create and update incidents?
I think at least @nodejs/tsc @nodejs/community-committee and @nodejs/build should have access.

@MattIPv4
Copy link
Member Author

MattIPv4 commented Apr 20, 2020

We have a limit of 25 email addresses that we can add (2 already used for myself & foundation ops). I'm happy to add 23 folks across teams that we think should have access.

(It should be noted that access is simply "access", they can do anything on the page.)

An alternative to be able grant more folks access would be to look at setting up a shared email login for each team or similar?

@sam-github
Copy link
Contributor

Following on from #2299 (comment)

Ideally, a GH team would be allowed to use as an auth source, or even better, an issue in a specific repo would be enough to drive status page changes (with repo access controlled by a team).

Whether that is possible or not, before considering "who", the policy of "when the page should be updated, and with what information" should be described. Then people can request access knowing what the access is to be used for.

@MattIPv4
Copy link
Member Author

StatusPage have quite a powerful API so we could certainly look at integrating GitHub with it to some degree: https://developer.statuspage.io/

I think it'd be more likely that we write an implementation to control access based on a GitHub team. Completely creating our own solution for creating & updating incidents via issues would likely be rather complicated (especially controlling what components get marked as affected, as well as what notifications to send), and StatusPage already has a good UI for this.

However, by relying on the StatusPage site, we do limit access to a set of 25 email addresses.

@MattIPv4
Copy link
Member Author

I think there are two categories of reason for having access (though StatusPage access just gives you access to everything):

  • Maintaining the status page itself (styling, components). Currently me.
  • Managing incidents. Lots of people?

@sam-github
Copy link
Contributor

Its the "what is an incident" that needs description. Would a missing download file (because that platform failed CI and was decided to be released later) an incident? What about nodejs/node#32914 ?

@MattIPv4
Copy link
Member Author

Yeah, absolutely. I think a lot of that comes down to human judgement of each situation, but I guess rather vaguely, I'd probably define an incident as a public-facing issue (optionally, that people are going to notice or is going to affect a lot of people).

So, for example, a CI machine crashing wouldn't really be an incident that requires posting on the status page and sending notifications out to the community, but downloads failing (which started this thread) would be something that should have an incident made for and notifications sent out.

@nschonni
Copy link
Member

Should an issue be opened on https://github.com/nodejs/nodejs.org to include this in the footer somewhere, or is there more to work out first?

@MattIPv4
Copy link
Member Author

I'll raise a draft PR there now to add links & embed, though I don't expect it to be merged until we're far more confident with using the page (access sorted, policy etc.).

@BernsteinA
Copy link

I haven't read any of the other comments but @bnb said I should post this here:
Please consider registering a new domain name for the status page. c.f. status.github.com -> githubstatus.com, in case there is an outage that prevents people from accessing nodejs.org and its subdomains

@MattIPv4
Copy link
Member Author

I've created two dedicated issues to better house discussion on these topics as I imagine folks will have a lot to say and discuss:

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@bnb
Copy link

bnb commented Apr 23, 2021

worth reopening @nodejs/build? I would love to see this if it's possible. Also happy to help set it up if that would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests