-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umbrella issue for HA and HA upgrades #1733
Comments
Actually I should just have added everyone to the party @kubernetes/sig-cluster-lifecycle |
This is generic across many database systems. New version system could be writing data format that the old version couldn't understand as new features getting added.
A common pattern is to periodically backup the data, not only limited to upgrade actually. In case of any disaster failure, you can restore etcd cluster from previous backup. Another common pattern is that database systems usually get upgraded pretty often in open source version, while most production env still keep using the old version. As a result, it lets broader community test the new version and get bugs fixed before it finally rolls out in production env. |
@chrislovecnm Awesome summary, and thanks for taking the lead on getting the issues raised, documented and vetted. Definitely want to try the upgrade process out on some clusters. I just noticed @hongchaodeng's slides and I thought something was interesting in there I want to learn about regarding mixed clusters on Slide 3:
There is probably some etcd documentation that talks about how mixed clusters that jump more than a minor version are bad but this is good to know. |
@kubernetes/docs can someone assign this to me? |
@gmarek you cannot cc people and not comment.... SGTM |
Thanks @dims |
@roberthbailey @gmarek @thockin ~ still have a bunch of open questions about HA specifics within k8s. Who can assist? |
PR with user doc for HA master can be find here: #1810 |
Thanks! I am mostly looking for details on how specific k8s components run ha, as I mentioned above. |
Bump - I know everyone is swamped with the release |
@smarterclayton / @bprashanth ~ @jbeda mentioned that you may be able to assist with some of these questions. Specifically basic HA with the open questions that I have. I know that a couple of folks in the HA team are on leave, and I wanted to reach out to you. Thanks Chris |
@chrislovecnm Can you please say which information in |
Also, there's a design doc which covers this with details: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/ha_master.md |
So those docs are awesome. That is how you deploy HA in GCE. Not really how HA works. Meaning https://raw.githubusercontent.com/kubernetes/kubernetes.github.io/3b572b567f655b771e4ff8a03a4f8baf9e094511/docs/admin/ha-master-gce.png <- how those components actually HA? Here are my questions from above: Here are the open questions:
^ how do all of those components implement HA?? What happens in a failure? What is recovery? Based on that ... how do I upgrade? Do I drain the master? Think of it from a users perspective. Do I need two Controller or do I need three? Well I need 3 etcd for sure... Thanks Chris |
@chrislovecnm Answers to all of those questions ARE in the docs I've sent you. Only the implementation is GCE specific, but the design doc doesn't talk about GCE (except for few places where say which option we will choose for GCE, e.g. loadbalancing).
"Think of it from a users perspective":
Does that answer your questions? I still don't understand what else is missing? Should the doc be restructured somehow to make it easier to find the answers? |
@fgrzadkowski let me review again. I am reading this from a different perspective. I am thinking about an upgrade as well. Let me re-read again. |
@fgrzadkowski thanks btw... I appreciate you patience and help with this. |
A note to myself. Add a section about how the active component is found by looking at the endpoints via kubectl on kube-system. |
Chris, is this an issue you're tracking on the site? What are the next steps for you? The SIG-Docs team is going through all of the docs issues and I want to make sure this is actually being worked on, otherwise I'm going to close it. Thanks! |
I've rewrote a doc about setting-up HA Clusters in general: #2941 |
A couple of other docs that have been recently discussed in sig cluster lifecycle that are relevant here: |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
ping @jszczepkowski @fgrzadkowski @chrislovecnm @kubernetes/sig-cluster-lifecycle-pr-reviews for a status on this |
/lifecycle frozen |
the HA and HA-ETC documentation received an overhaul in 1.11: all the new edits were the result of combined SIG efforts and based on existing proposals for improvements. closing until further notice. |
Making one home to review:
#1531 and #1530
cc: @roberthbailey @gmarek @thockin @kris-nova @justinsb @brandoncole @hongchaodeng
Great meeting you guys face to face in Seattle. The conversations about HA were really enjoyable, and I realized that I am missing some key points of exactly how a master runs in HA.
I would like to update the k8s website documentation for HA. After that, I will be writing base documentation for HA Upgrades. Do we have documentation on how each component in the Master handle leader election, and how each component is HA? I am guessing not in detail, so are the components that I am documenting, I have open questions, and really could use a second set of eyes.
HA
These components are well documented:
Here are the open questions:
Also found https://github.com/kubernetes/kubernetes/blob/master/docs/design/control-plane-resilience.md
Upgrades
Nodes are pretty simple …
kubectl drain
, create new node, wash, rinse repeat. Do we upgrade the masters and then the nodes? That is the pattern that GKE follows.In regards to each component ‘where be dragons’ with the masters in regards to upgrades? I am thinking that
kubectl drain
is going to handle all of the components expect for etcd. What about federation?Etdc upgrades are documented here: https://coreos.com/etcd/docs/2.3.7/upgrade_2_3.html
Any ideas with the fact that a 2.3->3.0 etcd cannot be rolled back? Any patterns that we could use? Backup, but lions, tigers, and bears oh my. I am wondering if bumping up the masters from 3 -> 6 and only upgrade three of them. Verify that the three new nodes are running correctly, and then remove the old nodes. From the etcd docs:
“If all members have been upgraded to v3.0, the cluster will be upgraded to v3.0, and downgrade from this completed state is not possible. If any single member is still v2.3, however, the cluster and its operations remains “v2.3”, and it is possible from this mixed cluster state to return to using a v2.3 etcd binary on all members”. Does this same pattern work with the controller manager, api, and scheduler?
Or we could just upgrade two masters, let it sit, and then upgrade the third? I am wondering if this is another reason to move etcd to its own server. We do immutable servers, so upgrading a single component does not work.
Thanks in advance!
The text was updated successfully, but these errors were encountered: