-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kops fails to join new master when 1 master lost etcd events data volume #9264
Comments
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Related issue on the |
@ivnilv this was a failure scenario we tested out as well, with the same resolution of manual intervention. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I ran into this issue and after trying different approaches I can say the safest way in case you lost etcd main or etcd events data is to restore it via etcd-manager-ctl. In my case I restore the most recent backup to fix the issue. You can run these command from anywhere as long as you have access to s3 bucket used by kops.
You can also read more on the kops docs. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Deleted the etcd-events data volume on 1 of 3 masters.
5. What happened after the commands executed?
New masters in the instanceGroup that had etcd-events data removed no longer joins the cluster.
6. What did you expect to happen?
ETCd-manager to remove the unhealthy member and add the new member without any manual interactions required.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
In order to fix the issue on our end we had to execute into 1 of the other 2 working etcd-events members, and run the
etcdctl member remove
command in order to remove the broken node from the list of members.Next thing, we increased the number of instances for the instanceGroup from 0 to 1 to create a new node, and the rest was taken care of automatically by kops and etcd-manager.. new node joined cluster properly.
The question is, shouldn't this be done automatically by kops or etcd-manager in some step ?
The text was updated successfully, but these errors were encountered: