-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenStack: etcd-manager does not always mount volumes resulting in invalid cluster creation #11323
Comments
Can you post etcd-manager logs from a host in which the issue occurs? |
Ofcourse, here you go. This is etcd.log. Same thing can happen with etcd-events.log. It's either etcd, or etcd-events getting in trouble.
'executable not found in path' I think points to udevadm (from openstack/volumes.go 'probeVolume') which works fine logging in to the server with SSH (/usr/bin). I do see the executable file not found in every single etcd log btw, including the succeeded ones, but for one node it repeats only once and for the other 20 times and starts working. For a failing node it never stops repeating. Nodes are running Debian 10 which should be the same as used in kops OpenStack tests afaik. |
@kciredor I can create an etcd-manager image that may fix the issue if you want to try. |
@kciredor If you don't mind, could you test using this image?
|
Thanks @hakman! My first test run is fine, I'll try again a couple of times today. |
Great, thanks :). |
I'm afraid the second attempt resulted in another scanning busses issue:
Manually running sudo udevadm trigger on the 2 out of 7 master nodes running into this problem gives me exit 0. Not sure how to debug this. |
If you exec from the pod |
Good one ;-)
Still interesting to me how one etcd-manager pod becomes healthy, and the other etcd-manager pod is missing a binary or library. |
Same here, but no idea how to reproduce and check that. One more: |
From the pod:
/var/log/etcd.log now runs into a loop:
|
etcd-manager runs this with
|
Using that command it gives me exit 0 (loop is still active though, keeps "Waiting for volume"). |
The first one is etcd1-main which is having trouble getting mounted on the master node. The second one is etcd1-events, which is mounted on another node. Is that 'normal'? |
I don't think so, but I am not that familiar with the openstack implementation. Maybe @olemarkus has some thoughts. |
Sorry, I am not that familiar. My guess is that there is a mismatch with tags. I think perhaps @zetaab is the one with the most knowledge of how this works now. |
if you have only one availability zone in computes + volumes, this is possible and normal. In case of many availability zones in computes + volumes this is not normal and should not happen. Which case matches to your setup? I do not now remember how these tags things works and what kind of values they can have. Anyways I am pretty sure that this is (one of) your problem. If you have only one availability zone - I am thinking that is all 3 masters trying to wait "same volume", because tags are same in all of them? Can you check |
This is single availability zone indeed. Here's the output of all volume show's, where one out of three masters does not become healthy:
I just double checked and the master node that can't mount the disk, actually has the volume properly attached to it. All 6 volumes are attached to 3 masters. Trouble starts right after
I can actually manually (The issue started with kops 1.19 coming from 1.18) |
I am sorry, I do not have idea what is wrong. Anyways, etcd-manager is not using volumes from /dev/vd* it is using those under /dev/disk/by-id. Can you check code https://github.com/kopeio/etcd-manager/blob/e6cc5c083a951d119c1aee44d379322e2f9ce08e/pkg/volumes/openstack/volumes.go#L346-L373 and check can you find volumes from that path? This code is similar that do exist in https://github.com/kubernetes/cloud-provider-openstack/blob/ef0206889cb7c39f04d215e8cbb6c69616e59ea4/pkg/util/mount/mount.go#L147-L179 so that should not work either for you in that case. |
Example run: etcd-main is properly mounted on the node in /mnt, but etcd-events is unable to mount. /var/log/etcd-events.log shows:
From the etcd-events container, the output @zetaab requested:
Now a manual mkfs.ext4 from the node with /dev/vdd or from the container with /rootfs/dev/vdd results in a clean format of an unformatted volume:
Looks like "safe-format-and-mount" now continues progress immediately and /var/log/etcd-events.log now shows:
Extra, listing of /dev/vdX on the etcd-main container:
Extra, listing of /dev/vdX on the etcd-events container:
Even though I'm looking at it for a while now and am able to make my own changes to etcd-manager, this is hard to debug. All I know for sure is a manual mkfs.ext4 immediately results in a happy flow. But I cannot find any output logs regarding a failed mkfs.ext4. |
Interesting. Using the new image now. But this time, on the master node, there was no /dev/vdc only /dev/vdd (and the corresponding /dev/disk/by-id/virtio-xxx symlinks). Running udevadm trigger on the master node itself, made /dev/vdd and the by-id symlink appear resulting in an immediately mounted etcd-events volume next to the existing etcd-main. Logs reflect this after a long loop of scanning busses as well:
On the etcd-events container udevadm is available as well. Kind of looks like udevadm trigger from the container, which is actually what happens from the code, should have done the trick but did not. So with the 'old' image I'm having a does-not-mkfs.ext4 problem and with the new image I'm having a udevadm-does-not-work problem, it appears ;-) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Just a heads up now that lifecycle/stale was applied, this is still an issue for us upgrading to Kops 1.20.2 and etcd-manager 1.20.2. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen this is still a thing with latest kOps on OpenStack. |
/reopen |
@hakman: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
1.19.1
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.1.19.9
3. What cloud provider are you using?
OpenStack
4. What commands did you run? What is the simplest way to reproduce this issue?
kops create cluster, kops update cluster --yes
5. What happened after the commands executed?
One out of two times one (only one) of the masters does not become healthy. It appears one of the two etcd volumes is not being formatted and mounted to the master node by etcd-manager.
6. What did you expect to happen?
Every cluster creation should result in every master having two etcd volumes resulting in a valid cluster.
9. Anything else do we need to know?
This issue started with kops 1.19, with kops 1.18 it was neverthere. By hardcoding other etcd-manager versions into my kops binary and hosting etcd-manager docker images myself, I was able to pinpoint the start of the issue between etcd-manager versions 3.0.20200531 (kops 1.18) and 3.0.20210122 (kops 1.19) to exact version 3.0.20201117 which starts creating failed clusters.
I'm having a hard time finding out which exact commit in etcd-manager 3.0.20201117 introduced the problem though. I've reverted some commits and did some testing, these are the ones I was suspecting but ended up not being the troublemakers:
It's quite a big list of changes between 3.0.20200531 and 3.0.20201117 ;-) Anyone got an idea what's going on here?
Cheers,
kciredor
The text was updated successfully, but these errors were encountered: