Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenStack: etcd-manager does not always mount volumes resulting in invalid cluster creation #11323

Closed
kciredor opened this issue Apr 26, 2021 · 35 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@kciredor
Copy link

1. What kops version are you running? The command kops version, will display
this information.

1.19.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.19.9

3. What cloud provider are you using?
OpenStack

4. What commands did you run? What is the simplest way to reproduce this issue?
kops create cluster, kops update cluster --yes

5. What happened after the commands executed?
One out of two times one (only one) of the masters does not become healthy. It appears one of the two etcd volumes is not being formatted and mounted to the master node by etcd-manager.

6. What did you expect to happen?
Every cluster creation should result in every master having two etcd volumes resulting in a valid cluster.

9. Anything else do we need to know?
This issue started with kops 1.19, with kops 1.18 it was neverthere. By hardcoding other etcd-manager versions into my kops binary and hosting etcd-manager docker images myself, I was able to pinpoint the start of the issue between etcd-manager versions 3.0.20200531 (kops 1.18) and 3.0.20210122 (kops 1.19) to exact version 3.0.20201117 which starts creating failed clusters.

I'm having a hard time finding out which exact commit in etcd-manager 3.0.20201117 introduced the problem though. I've reverted some commits and did some testing, these are the ones I was suspecting but ended up not being the troublemakers:

It's quite a big list of changes between 3.0.20200531 and 3.0.20201117 ;-) Anyone got an idea what's going on here?

Cheers,
kciredor

@rifelpet
Copy link
Member

Can you post etcd-manager logs from a host in which the issue occurs?

@kciredor
Copy link
Author

kciredor commented Apr 26, 2021

Ofcourse, here you go. This is etcd.log. Same thing can happen with etcd-events.log. It's either etcd, or etcd-events getting in trouble.

etcd-manager
I0426 13:18:09.656532    6229 volumes.go:206] Found project="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
I0426 13:18:09.656658    6229 volumes.go:215] Found instanceName="master-amsterdam1-7-1-dv-dev-k8s-local"
I0426 13:18:09.955072    6229 volumes.go:235] Found internalIP="10.0.32.64" and zone="amsterdam1"
I0426 13:18:09.955126    6229 main.go:291] Mounting available etcd volumes matching tags [KubernetesCluster=dv-dev.k8s.local k8s.io/etcd/main k8s.io/role/master=1]; nameTag=k8s.io/etcd/main
I0426 13:18:09.955162    6229 volumes.go:309] Listing Openstack disks in xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/amsterdam1
I0426 13:18:10.381018    6229 mounter.go:294] Trying to mount master volume: "d024da28-4963-40d6-9b69-1ed7cdb6a669"
W0426 13:18:10.902020    6229 mounter.go:299] Error attaching volume "d024da28-4963-40d6-9b69-1ed7cdb6a669": error attaching volume d024da28-4963-40d6-9b69-1ed7cdb6a669 to server ea1c063d-3512-4f0f-b584-1da32ec33505: Bad request with: [POST https://xxxxx:8774/v2.1/servers/ea1c063d-3512-4f0f-b584-1da32ec33505/os-volume_attachments], error message: {"badRequest": {"code": 400, "message": "Invalid input received: Invalid volume: Volume d024da28-4963-40d6-9b69-1ed7cdb6a669 status must be available or downloading to reserve, but the current status is in-use. (HTTP 400) (Request-ID: req-f5bd664a-a8b3-4fdb-a7aa-f74aa6c19658)"}}
I0426 13:18:10.902165    6229 mounter.go:294] Trying to mount master volume: "36f07f1c-f3e7-461a-89eb-d45bf39240c3"
I0426 13:18:11.379120    6229 mounter.go:308] Currently attached volumes: [0xc0006d2480]
I0426 13:18:11.379507    6229 mounter.go:72] Master volume "36f07f1c-f3e7-461a-89eb-d45bf39240c3" is attached at "/dev/vdd"
I0426 13:18:11.379617    6229 mounter.go:86] Doing safe-format-and-mount of /dev/vdd to /mnt/master-etcd-6.etcd-main.dv-dev.k8s.local
I0426 13:18:11.379893    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:11.698782    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:18:17.699193    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:18.026724    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:18:24.027130    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:24.354981    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:18:30.355463    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:30.682683    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:18:36.683407    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:37.018652    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:18:43.019156    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:43.346802    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:18:49.347489    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:49.670824    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:18:55.671283    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:18:55.998525    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:01.999139    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:02.336105    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:08.336694    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:08.672517    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:14.673101    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:14.994942    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:20.995327    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:21.343629    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:27.344021    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:27.662865    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:33.663280    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:33.982831    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:39.983366    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:40.316346    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:46.316810    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:46.634795    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:52.635202    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:52.963100    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:19:58.963559    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:19:59.310110    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:05.310743    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:05.638669    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:11.640356    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:11.973164    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:17.973595    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:18.323111    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:24.323493    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:24.646817    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:30.647255    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:30.972878    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:36.973414    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:37.298783    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:43.299199    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:43.631661    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:43.631747    6229 mounter.go:121] Waiting for volume "36f07f1c-f3e7-461a-89eb-d45bf39240c3" to be mounted
I0426 13:20:44.632066    6229 volumes.go:408] Could not find device path for volume; scanning buses
I0426 13:20:44.958943    6229 volumes.go:410] Error scanning buses: executable file not found in $PATH
I0426 13:20:50.959496    6229 volumes.go:408] Could not find device path for volume; scanning buses

<cut: from there the scanning buses / waiting to be mounted keeps repeating itself>

'executable not found in path' I think points to udevadm (from openstack/volumes.go 'probeVolume') which works fine logging in to the server with SSH (/usr/bin).

I do see the executable file not found in every single etcd log btw, including the succeeded ones, but for one node it repeats only once and for the other 20 times and starts working. For a failing node it never stops repeating.

Nodes are running Debian 10 which should be the same as used in kops OpenStack tests afaik.

@hakman
Copy link
Member

hakman commented Apr 26, 2021

@kciredor I can create an etcd-manager image that may fix the issue if you want to try.
Probably udev needs to be added to this list:
https://github.com/kubernetes-sigs/etcdadm/blob/c09b9e2d7cca67372328772f483c1f3249f25247/etcd-manager/images/base/BUILD#L11-L24.

@hakman
Copy link
Member

hakman commented Apr 28, 2021

@kciredor If you don't mind, could you test using this image?

gcr.io/k8s-staging-etcdadm/etcd-manager:v20210428-v0.1.3-729-g7530bcea

@kciredor
Copy link
Author

Thanks @hakman! My first test run is fine, I'll try again a couple of times today.

@hakman
Copy link
Member

hakman commented Apr 28, 2021

Great, thanks :).

@kciredor
Copy link
Author

I'm afraid the second attempt resulted in another scanning busses issue:

I0428 09:14:39.928221    5379 volumes.go:419] Error scanning buses: exit status 127
I0428 09:14:45.929438    5379 volumes.go:417] Could not find device path for volume; scanning buses

Manually running sudo udevadm trigger on the 2 out of 7 master nodes running into this problem gives me exit 0. Not sure how to debug this.

@hakman
Copy link
Member

hakman commented Apr 28, 2021

If you exec from the pod /sbin/udevadm does it do anything?

@kciredor
Copy link
Author

kciredor commented Apr 29, 2021

Good one ;-)

/sbin/udevadm: error while loading shared libraries: libkmod.so.2: cannot open shared object file: No such file or directory

Still interesting to me how one etcd-manager pod becomes healthy, and the other etcd-manager pod is missing a binary or library.

@hakman
Copy link
Member

hakman commented Apr 29, 2021

Still interesting to me how one etcd-manager pod becomes healthy, and the other etcd-manager pod is missing a binary or library.

Same here, but no idea how to reproduce and check that.

One more:
gcr.io/k8s-staging-etcdadm/etcd-manager:v20210429-v0.1.3-733-g4703d502

@kciredor
Copy link
Author

From the pod:

# /sbin/udevadm trigger
Running in chroot, ignoring request.

/var/log/etcd.log now runs into a loop:

I0429 11:12:04.081904    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:10.411634    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:16.748487    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:23.084319    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:29.417736    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:29.756037    5665 mounter.go:121] Waiting for volume "b5162d3c-e1f6-45c9-b5bd-14ea4ba485d2" to be mounted
I0429 11:12:30.756964    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:37.096833    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:43.445428    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:49.789934    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:12:56.118686    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:13:02.464056    5665 volumes.go:417] Could not find device path for volume; scanning buses
I0429 11:13:08.796976    5665 volumes.go:417] Could not find device path for volume; scanning buses

@hakman
Copy link
Member

hakman commented Apr 29, 2021

etcd-manager runs this with nsenter so you would not see the "Running in chroot, ignoring request.".

nsenter --mount=/rootfs/proc/1/ns/mnt -- /sbin/udevadm trigger

@kciredor
Copy link
Author

Using that command it gives me exit 0 (loop is still active though, keeps "Waiting for volume").

@kciredor
Copy link
Author

openstack volume list

| b5162d3c-e1f6-45c9-b5bd-14ea4ba485d2 | etcd-1.etcd-main.dv-dev.k8s.local   | in-use |   20 | Attached to master-amsterdam1-6-1-dv-dev-k8s-local on /dev/vdd  |

<cut>

| 055c15f8-45ec-4b48-aa79-c37b81e2156b | etcd-1.etcd-events.dv-dev.k8s.local | in-use |   20 | Attached to master-amsterdam1-3-1-dv-dev-k8s-local on /dev/vdd  |

The first one is etcd1-main which is having trouble getting mounted on the master node. The second one is etcd1-events, which is mounted on another node. Is that 'normal'?

@hakman
Copy link
Member

hakman commented Apr 29, 2021

I don't think so, but I am not that familiar with the openstack implementation. Maybe @olemarkus has some thoughts.

@olemarkus
Copy link
Member

Sorry, I am not that familiar. My guess is that there is a mismatch with tags. I think perhaps @zetaab is the one with the most knowledge of how this works now.

@zetaab
Copy link
Member

zetaab commented May 3, 2021

https://github.com/kopeio/etcd-manager/blob/e6cc5c083a951d119c1aee44d379322e2f9ce08e/pkg/volumes/openstack/volumes.go#L285

if you have only one availability zone in computes + volumes, this is possible and normal. In case of many availability zones in computes + volumes this is not normal and should not happen. Which case matches to your setup?

I do not now remember how these tags things works and what kind of values they can have. Anyways I am pretty sure that this is (one of) your problem. If you have only one availability zone - I am thinking that is all 3 masters trying to wait "same volume", because tags are same in all of them? Can you check openstack volume show <id> and what kind of properties those have? Are those same or not? You should have 6 volumes

@kciredor
Copy link
Author

kciredor commented May 4, 2021

This is single availability zone indeed.

Here's the output of all volume show's, where one out of three masters does not become healthy:

$ openstack volume show cb2bef34-e09c-42fa-9822-7b9f87477860 | grep properties
| properties                   | KubernetesCluster='dv-dev.k8s.local', k8s.io/etcd/main='etcd-2/etcd-1,etcd-2,etcd-3', k8s.io/role/master='1'                                                                                                                                                                                                  

$ openstack volume show 5f6b611a-6688-4357-9adb-176f15f131ef | grep properties
| properties                   | KubernetesCluster='dv-dev.k8s.local', k8s.io/etcd/main='etcd-3/etcd-1,etcd-2,etcd-3', k8s.io/role/master='1'                                                                                                                                                                                                 

$ openstack volume show 5c44493d-525e-4b07-b078-6e258e3c858d | grep properties
| properties                   | KubernetesCluster='dv-dev.k8s.local', k8s.io/etcd/main='etcd-1/etcd-1,etcd-2,etcd-3', k8s.io/role/master='1'                                                                                                                                                                                                  

$ openstack volume show bd8d6578-82e6-4680-9109-064cbc7c5bdf | grep properties
| properties                   | KubernetesCluster='dv-dev.k8s.local', k8s.io/etcd/events='etcd-3/etcd-1,etcd-2,etcd-3', k8s.io/role/master='1'                                                                                                                                                                                                

$ openstack volume show afac5a39-f068-407a-bcdf-eedbd5e0d108 | grep properties
| properties                   | KubernetesCluster='dv-dev.k8s.local', k8s.io/etcd/events='etcd-2/etcd-1,etcd-2,etcd-3', k8s.io/role/master='1'                                                                                                                                                                                               

$ openstack volume show abf662ac-bdd3-4d61-af0a-c8bccbe0f2a0 | grep properties
| properties                   | KubernetesCluster='dv-dev.k8s.local', k8s.io/etcd/events='etcd-1/etcd-1,etcd-2,etcd-3', k8s.io/role/master='1'

I just double checked and the master node that can't mount the disk, actually has the volume properly attached to it. All 6 volumes are attached to 3 masters. Trouble starts right after

I0504 10:18:46.212143    5148 mounter.go:72] Master volume "5c44493d-525e-4b07-b078-6e258e3c858d" is attached at "/dev/vdc"
I0504 10:18:46.213743    5148 mounter.go:86] Doing safe-format-and-mount of /dev/vdc to /mnt/master-etcd-1.etcd-main.dv-dev.k8s.local

I can actually manually mkfs.ext4 /dev/vdc and mount /dev/vdc /mnt/test at that point in time. Straight after doing so, etcd-manager manages to mount it to the proper path itself (in this case: /mnt/master-etcd-1.etcd-main.dv-dev.k8s.local).

(The issue started with kops 1.19 coming from 1.18)

@zetaab
Copy link
Member

zetaab commented May 5, 2021

I am sorry, I do not have idea what is wrong. Anyways, etcd-manager is not using volumes from /dev/vd* it is using those under /dev/disk/by-id. Can you check code https://github.com/kopeio/etcd-manager/blob/e6cc5c083a951d119c1aee44d379322e2f9ce08e/pkg/volumes/openstack/volumes.go#L346-L373 and check can you find volumes from that path? ls -l /dev/disk/by-id/ (inside the etcd-manager container) and then check does any candidateDeviceNodes match to your use-case. You can also copy paste the result from the ls -l command here so we can help debugging.

This code is similar that do exist in https://github.com/kubernetes/cloud-provider-openstack/blob/ef0206889cb7c39f04d215e8cbb6c69616e59ea4/pkg/util/mount/mount.go#L147-L179 so that should not work either for you in that case.

@kciredor
Copy link
Author

Example run:

etcd-main is properly mounted on the node in /mnt, but etcd-events is unable to mount.

/var/log/etcd-events.log shows:

I0510 08:19:11.766807    6159 mounter.go:294] Trying to mount master volume: "54d11e67-9f9c-4ffe-9d72-0f4bc39668b1"
I0510 08:19:12.257280    6159 mounter.go:308] Currently attached volumes: [0xc000679400]
I0510 08:19:12.257318    6159 mounter.go:72] Master volume "54d11e67-9f9c-4ffe-9d72-0f4bc39668b1" is attached at "/dev/vdd"
I0510 08:19:12.257348    6159 mounter.go:86] Doing safe-format-and-mount of /dev/vdd to /mnt/master-etcd-1.etcd-events.dv-dev.k8s.local
...
I0510 08:21:44.494267    6159 mounter.go:121] Waiting for volume "54d11e67-9f9c-4ffe-9d72-0f4bc39668b1" to be 
mounted

From the etcd-events container, the output @zetaab requested:

# ls /dev/disk
ls: cannot access '/dev/disk': No such file or directory
# ls -l /rootfs/dev/disk/by-id
total 0
lrwxrwxrwx 1 root root  9 May 10 08:18 virtio-0fb0894a-e8b0-4a7f-a -> ../../vdc
lrwxrwxrwx 1 root root  9 May 10 12:03 virtio-54d11e67-9f9c-4ffe-9 -> ../../vdd
lrwxrwxrwx 1 root root  9 May 10 08:15 virtio-d49e746c-7ff4-4464-9 -> ../../vda
lrwxrwxrwx 1 root root 10 May 10 08:15 virtio-d49e746c-7ff4-4464-9-part1 -> ../../vda1

Now a manual mkfs.ext4 from the node with /dev/vdd or from the container with /rootfs/dev/vdd results in a clean format of an unformatted volume:

# mkfs.ext4 /rootfs/dev/vdd
mke2fs 1.44.5 (15-Dec-2018)
Creating filesystem with 5242880 4k blocks and 1310720 inodes
Filesystem UUID: c7790ef8-37b2-4dd1-99f9-2f7bda291b97
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

Looks like "safe-format-and-mount" now continues progress immediately and /var/log/etcd-events.log now shows:

I0510 12:03:17.966526    6159 volumes.go:357] Found disk attached as "virtio-54d11e67-9f9c-4ffe-9"; full devicepath: /rootfs/dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9
I0510 12:03:17.966590    6159 mounter.go:125] Found volume "54d11e67-9f9c-4ffe-9d72-0f4bc39668b1" mounted at device "/dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9"
I0510 12:03:17.968493    6159 mounter.go:167] Creating mount directory "/rootfs/mnt/master-etcd-1.etcd-events.dv-dev.k8s.local"
I0510 12:03:17.969010    6159 mounter.go:172] Mounting device "/dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9" on "/mnt/master-etcd-1.etcd-events.dv-dev.k8s.local"
I0510 12:03:17.969080    6159 mount_linux.go:405] Attempting to determine if disk "/dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9])
I0510 12:03:17.969248    6159 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- blkid -p -s TYPE -s PTTYPE -o export /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9]
I0510 12:03:18.002710    6159 mount_linux.go:408] Output: "DEVNAME=/dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9\nTYPE=ext4\n", err: <nil>
I0510 12:03:18.002768    6159 mount_linux.go:298] Checking for issues with fsck on disk: /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9
I0510 12:03:18.002793    6159 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- fsck -a /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9]
I0510 12:03:18.052118    6159 mount_linux.go:394] Attempting to mount disk /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9 in ext4 format at /mnt/master-etcd-1.etcd-events.dv-dev.k8s.local
I0510 12:03:18.052189    6159 nsenter.go:80] nsenter mount /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9 /mnt/master-etcd-1.etcd-events.dv-dev.k8s.local ext4 [defaults]
I0510 12:03:18.052486    6159 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/systemd-run --description=Kubernetes transient mount for /mnt/master-etcd-1.etcd-events.dv-dev.k8s.local --scope -- /bin/mount -t ext4 -o defaults /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9 /mnt/master-etcd-1.etcd-events.dv-dev.k8s.local]
I0510 12:03:18.119726    6159 nsenter.go:84] Output of mounting /dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9 to /mnt/master-etcd-1.etcd-events.dv-dev.k8s.local: Running scope as unit: run-rc22bec436db047b7b0e91da2b9f99c37.scope
I0510 12:03:18.119881    6159 mount_linux.go:163] Detected OS without systemd
I0510 12:03:18.121797    6159 mounter.go:252] device "/dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9" did not evaluate as a symlink: lstat /dev/disk: no such file or directory
I0510 12:03:18.121904    6159 mounter.go:252] device "/dev/vdd" did not evaluate as a symlink: lstat /dev/vdd: no such file or directory
I0510 12:03:18.121941    6159 mounter.go:232] matched device "/dev/disk/by-id/virtio-54d11e67-9f9c-4ffe-9" and "/dev/vdd" via '\x00'
I0510 12:03:18.122005    6159 mounter.go:94] mounted master volume "54d11e67-9f9c-4ffe-9d72-0f4bc39668b1" on /mnt/master-etcd-1.etcd-events.dv-dev.k8s.local

Extra, listing of /dev/vdX on the etcd-main container:

# ls -l /dev/vd*
brw-rw---- 1 root 6 254,  0 May 10 08:18 /dev/vda
brw-rw---- 1 root 6 254,  1 May 10 08:18 /dev/vda1
brw-rw---- 1 root 6 254, 16 May 10 08:18 /dev/vdb

Extra, listing of /dev/vdX on the etcd-events container:

root@master-amsterdam1-5-1-dv-dev-k8s-local:~# ls -l /dev/vd*
brw-rw---- 1 root disk 254,  0 May 10 08:15 /dev/vda
brw-rw---- 1 root disk 254,  1 May 10 08:15 /dev/vda1
brw-rw---- 1 root disk 254, 16 May 10 08:17 /dev/vdb
brw-rw---- 1 root disk 254, 32 May 10 08:18 /dev/vdc
brw-rw---- 1 root disk 254, 48 May 10 12:03 /dev/vdd

Even though I'm looking at it for a while now and am able to make my own changes to etcd-manager, this is hard to debug. All I know for sure is a manual mkfs.ext4 immediately results in a happy flow. But I cannot find any output logs regarding a failed mkfs.ext4.

@hakman
Copy link
Member

hakman commented May 10, 2021

@kciredor There's a new etcd-manager version that was released last week. Could you use that for further debugging, as it has various updates that may help?
#11098

@kciredor
Copy link
Author

Interesting. Using the new image now. But this time, on the master node, there was no /dev/vdc only /dev/vdd (and the corresponding /dev/disk/by-id/virtio-xxx symlinks). Running udevadm trigger on the master node itself, made /dev/vdd and the by-id symlink appear resulting in an immediately mounted etcd-events volume next to the existing etcd-main.

Logs reflect this after a long loop of scanning busses as well:

I0510 13:53:10.175364    6041 mounter.go:121] Waiting for volume "3def81ce-becc-430f-a3bf-a65ce95b823f" to be mounted
I0510 13:53:11.176163    6041 volumes.go:417] Could not find device path for volume; scanning buses
I0510 13:53:17.507287    6041 volumes.go:417] Could not find device path for volume; scanning buses
I0510 13:53:23.845494    6041 volumes.go:417] Could not find device path for volume; scanning buses
I0510 13:53:30.188972    6041 volumes.go:417] Could not find device path for volume; scanning buses
I0510 13:53:36.523191    6041 volumes.go:366] Found disk attached as "virtio-3def81ce-becc-430f-a"; full devicepath: /rootfs/dev/disk/by-id/virtio-3def81ce-becc-430f-a

On the etcd-events container udevadm is available as well. Kind of looks like udevadm trigger from the container, which is actually what happens from the code, should have done the trick but did not.

So with the 'old' image I'm having a does-not-mkfs.ext4 problem and with the new image I'm having a udevadm-does-not-work problem, it appears ;-)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2021
@kciredor
Copy link
Author

kciredor commented Aug 9, 2021

Just a heads up now that lifecycle/stale was applied, this is still an issue for us upgrading to Kops 1.20.2 and etcd-manager 1.20.2.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kciredor
Copy link
Author

kciredor commented Oct 8, 2021

/reopen this is still a thing with latest kOps on OpenStack.

@hakman
Copy link
Member

hakman commented Oct 8, 2021

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@hakman: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Oct 8, 2021
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 8, 2021
@hakman
Copy link
Member

hakman commented Oct 8, 2021

@kciredor Maybe you are more lucky with the latest etcd-manager image: 78f1367

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 5, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants