-
Notifications
You must be signed in to change notification settings - Fork 45
etcd-manager can get into state where it doesn't start etcd on some nodes #288
Comments
Logs from the 2c node not coming up:
(last two lines continue printing every so many seconds) |
cc @justinsb |
It seems that the etcd-manager leader is supposed to be telling 2c to start, but it's not since 2c thinks that etcd has already started but is just unhealthy. |
I faced the same issue. Is there any workaround? |
I have the same issue with etcd-events.... |
Plus one in here, just hit this... any workaround will be appreciated |
@pkutishch we have found a fix.
I think this is because etcd sees the member as an already existing cluster member and just tries to join it, instead of checking the data and rejoining the node as a new one to resync the data. |
@Dimitar-Boychev Here is the thing.
As solution i started etcd manually in the etcd manager container, honestly i expect etcd process to throw error during startup , but it started fine and broad node to the cluster, but with restarting etcdmanager it didn't work, even with attempts to upgrade version. Interestingly this happened only on the one node out of three |
We hit this bug few days ago. Due to the incident we had to delete VM and lost all data stored on attached volumes (for events and main etcd instances). Since then we cannot make one, newly provisioned master node to join the cluster. Termination of the failed master doesn't work for us. And to start etcd manually you have to have all the puzzles together however in our case we don't have all necessary keys and certs. Does anyone know how we can recreate them? |
@pkutishch I think that by restarting the etcdmanager you put your self in the same situation as before.
If I am right and the above steps are what you think you did...
All of this is just based on our easy way to recreate this problem by just going on 1 etcd node and deleting all the data in the etcd data directory :) |
Currently seeing our cluster has gotten into a state where it's cluster state knows about all three members, but marks one as unhealthy because it's not responding to etcd checks. However, the reason it's not responding is because the gRPC command to join the cluster hasn't been initiated, because it already knows the member exists.
Of note is that this host runs two instances of etcd-manager, one for events and one for main Kubernetes objects. Only one of the instances is "broken".
Log excerpt from etcd-manager leader:
The text was updated successfully, but these errors were encountered: