CSI driver fails to clean up deleted PVs after intree migration #4242

phoerious · 2023-11-07T16:15:49Z

Describe the bug

I recently migrated from the in-tree Ceph storage driver to the CSI driver and wanted to enable the migration plugin for existing kubernetes.io/rbd volumes.

I used these two documents for reference:

I noticed that both are relatively incomplete and grammatically highly confusing. I think I did everything that was required for the migration, but I don't really know whether the legacy plugin is really redirected to the CSI driver or not. I believe it is, since I tried what was written in the first document above:

Kubernetes storage admin supposed to create a clusterID based on the monitors hash ( ex: #md5sum <<< "monaddress:port") in the CSI config map

and I got errors in the provisioner log about it not finding the correct cluster ID. I do not get an error when I generate the hash without a trailing \n using echo -n "<monaddress[es]:port>" | md5sum instead (I think this is a bug in the docs!).

My main issue, however, is that when I create a new RBD using the legacy storage class, an RBD gets provisioned and cleaned up, but the PV spec gets stuck in a Terminating state with the following error:

Warning  VolumeFailedDelete  4s (x6 over 14s)  rbd.csi.ceph.com_ceph-csi-rbd-provisioner-789d77444b-7nlsm_0308f221-8899-470e-8098-b35d78cdb3dc  rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The provisioner logs this

I1107 16:04:06.831249       1 controller.go:1502] delete "pvc-b798f870-0157-4882-a0de-eee85c93ff4b": started
E1107 16:04:06.853166       1 controller.go:1512] delete "pvc-b798f870-0157-4882-a0de-eee85c93ff4b": volume deletion failed: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
W1107 16:04:06.853214       1 controller.go:989] Retrying syncing volume "pvc-b798f870-0157-4882-a0de-eee85c93ff4b", failure 10
E1107 16:04:06.853245       1 controller.go:1007] error syncing volume "pvc-b798f870-0157-4882-a0de-eee85c93ff4b": rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
I1107 16:04:06.853299       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-b798f870-0157-4882-a0de-eee85c93ff4b", UID:"59cede5c-1403-465f-8cd3-f9bfa8b2b94e", APIVersion:"v1", ResourceVersion:"3420567564", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The existence of this error seems to indicate that the CSI plugin does indeed handle the kubernetes.io/rbd requests, although with an error.

I did verify with rbd ls rbd.k8s-pvs | grep VOLUME_NAME that the RBD volume gets created and deleted correctly, so this is a bogus "Permission denied" error. It is annoying nonetheless, since the only way to get rid of the PV is to edit the spec and remove the finalizer.

Environment details

Image/version of Ceph CSI driver : 3.9.0
Helm chart version : 3.9.0
Kernel version : 5.4.0-153-generic
Kubernetes cluster version : 1.28
Ceph cluster version : Quincy

Steps to reproduce

Steps to reproduce the behavior:

Enable RBD CSI migration feature gates
Create and bind PVC with legacy storage class
Delete PVC/PV.

Actual results

RBD volume gets created and deleted, PVC is deleted as well, but PV gets stuck in Terminating state with a bogus Permission denied error.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-12-07T21:01:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

phoerious · 2023-12-07T21:02:24Z

No, thank you!

github-actions · 2024-01-08T21:01:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

phoerious · 2024-01-08T22:19:24Z

Jeez....

github-actions · 2024-02-08T21:01:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

phoerious · 2024-02-09T09:00:50Z

😞

Madhu-1 · 2024-02-09T09:52:34Z

connecting failed: rados: ret=-13, Permission denied

This mostly happens due to the permission issue , can you please check and update ceph user caps as per https://github.com/ceph/ceph-csi/blob/devel/docs/capabilities.md

@phoerious we really dont have solid E2E for the migration, if you have logs we can try to debug and see what is happening

phoerious · 2024-02-09T10:43:02Z

These are the permissions of both the new CSI user and the old legacy user:

caps mgr = "allow rw"
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd.k8s-pvs, profile rbd pool=rbd.k8s-pvs-ssd"

I create a PVC with the old storage class name, which gets rerouted to the new CSI driver. When I try to delete that PVC, the associated PV gets stuck "Terminating" with this:

  Warning  VolumeFailedDelete  4s (x6 over 14s)  rbd.csi.ceph.com_ceph-csi-rbd-provisioner-789d77444b-fjrmg_673c5c4f-7ce8-424f-836e-22e2d06cc1ad  rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The provisioner log is littered with this:

I0209 10:38:22.378293       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6", UID:"6b536ef6-9ceb-4879-a2e2-c10c3f9fe20a", APIVersion:"v1", ResourceVersion:"3698428630", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
I0209 10:39:26.379253       1 controller.go:1502] delete "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6": started
E0209 10:39:26.407627       1 controller.go:1512] delete "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6": volume deletion failed: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
W0209 10:39:26.407731       1 controller.go:989] Retrying syncing volume "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6", failure 8
E0209 10:39:26.407806       1 controller.go:1007] error syncing volume "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6": rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
I0209 10:39:26.407882       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6", UID:"6b536ef6-9ceb-4879-a2e2-c10c3f9fe20a", APIVersion:"v1", ResourceVersion:"3698428630", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The associated RBD in the pool has long been deleted.

rbd -p rbd.k8s-pvs ls | grep kubernetes-dynamic-pvc-e7c7501f-c1c4-42bb-bef1-32b57d418def

That's all I have.

Madhu-1 · 2024-02-12T09:26:44Z

caps mgr = "allow rw"
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd.k8s-pvs, profile rbd pool=rbd.k8s-pvs-ssd"

can you please remove extra profile from the osd caps and see if that is the one causing the issue, can you make it as below

caps mgr = "allow rw"
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd.k8s-pvs"

phoerious · 2024-02-15T09:14:10Z

Same thing.

github-actions · 2024-03-16T21:01:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

phoerious · 2024-03-17T11:30:39Z

Nope, still there.

github-actions · 2024-04-17T21:02:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

phoerious · 2024-04-18T15:20:50Z

🎺

github-actions bot added the wontfix This will not be worked on label Dec 7, 2023

github-actions bot removed the wontfix This will not be worked on label Dec 8, 2023

github-actions bot added the wontfix This will not be worked on label Jan 8, 2024

github-actions bot removed the wontfix This will not be worked on label Jan 9, 2024

github-actions bot added the wontfix This will not be worked on label Feb 8, 2024

github-actions bot removed the wontfix This will not be worked on label Feb 9, 2024

dragoangel mentioned this issue Mar 12, 2024

ceph-csi-rbd provisioner tries to handle delete for rbd-provisioner pvs creating stucked pvs #4488

Closed

github-actions bot added the wontfix This will not be worked on label Mar 16, 2024

github-actions bot removed the wontfix This will not be worked on label Mar 17, 2024

github-actions bot added the wontfix This will not be worked on label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI driver fails to clean up deleted PVs after intree migration #4242

CSI driver fails to clean up deleted PVs after intree migration #4242

phoerious commented Nov 7, 2023 •

edited

Loading

github-actions bot commented Dec 7, 2023

phoerious commented Dec 7, 2023

github-actions bot commented Jan 8, 2024

phoerious commented Jan 8, 2024

github-actions bot commented Feb 8, 2024

phoerious commented Feb 9, 2024

Madhu-1 commented Feb 9, 2024

phoerious commented Feb 9, 2024 •

edited

Loading

Madhu-1 commented Feb 12, 2024

phoerious commented Feb 15, 2024

github-actions bot commented Mar 16, 2024

phoerious commented Mar 17, 2024

github-actions bot commented Apr 17, 2024

phoerious commented Apr 18, 2024

CSI driver fails to clean up deleted PVs after intree migration #4242

CSI driver fails to clean up deleted PVs after intree migration #4242

Comments

phoerious commented Nov 7, 2023 • edited Loading

Describe the bug

Environment details

Steps to reproduce

Actual results

github-actions bot commented Dec 7, 2023

phoerious commented Dec 7, 2023

github-actions bot commented Jan 8, 2024

phoerious commented Jan 8, 2024

github-actions bot commented Feb 8, 2024

phoerious commented Feb 9, 2024

Madhu-1 commented Feb 9, 2024

phoerious commented Feb 9, 2024 • edited Loading

Madhu-1 commented Feb 12, 2024

phoerious commented Feb 15, 2024

github-actions bot commented Mar 16, 2024

phoerious commented Mar 17, 2024

github-actions bot commented Apr 17, 2024

phoerious commented Apr 18, 2024

phoerious commented Nov 7, 2023 •

edited

Loading

phoerious commented Feb 9, 2024 •

edited

Loading