Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete snapshots before deleting RBD image #3416

Closed
vriabyk opened this issue Oct 6, 2022 · 10 comments
Closed

Delete snapshots before deleting RBD image #3416

vriabyk opened this issue Oct 6, 2022 · 10 comments
Labels
component/rbd Issues related to RBD question Further information is requested wontfix This will not be worked on

Comments

@vriabyk
Copy link

vriabyk commented Oct 6, 2022

Describe the feature you'd like to have

I want image snapshots to be deleted on image delete request. So basically smth like this:

# rbd snap purge IMAGE_ID
# rbd rm IMAGE_ID

What is the value to the end user? (why is it a priority?)

It is important because images which have snapshots cannot be deleted from ceph and get stuck in trash. If you try to delete k8s pvc(pv) which has rbd snapshots in ceph, the image won't be actually deleted from ceph and will get stuck in trash. There will be a lot of messages in ceph mgr logs like this:

debug 2022-09-30T10:07:30.214+0000 7fb618bba700  0 [rbd_support INFO root] execute_trash_remove: task={"sequence": 47220, "id": "701e47ae-fb41-4838-ae32-a26e802d0097", "message": "Removing image csi/45e917746337a from trash", "refs": {"action": "trash remove", "pool_name": "csi", "pool_namespace": "", "image_id": "45e917746337a"}, "retry_message": "[errno 39] RBD image has snapshots (error deleting image from trash)", "retry_attempts": 9, "retry_time": "2022-09-30T10:07:28.411597", "in_progress": true, "progress": 0.0}
debug 2022-09-30T10:07:30.230+0000 7fb618bba700  0 [rbd_support ERROR root] execute_task: [errno 39] RBD image has snapshots (error deleting image from trash)

How will we know we have a good solution? (acceptance criteria)

The image isn't getting stuck in trash.

Additional context

As I can see this problem was solved in Ceph dashboard some time ago:

https://tracker.ceph.com/issues/36404

So they delete snapshots and then delete image.

Ceph CSI stops provisioning new volumes once ceph trash is full. k8s pvcs are just waiting in Provision state and logs are full of messages like:

I0930 10:06:28.594040       1 rbd_util.go:610] ID: 105522 Req-ID: 0001-0024-3bbd557f-6326-493f-a98d-3491181374a8-0000000000000003-bab0957c-3a77-11ed-a1f6-9234d9da000a rbd: delete production-island10bab0957c-3a77-11ed-a1f6-9234d9da000a-temp using mon 10.25.37.17:6789,10.25.37.18:6789,10.25.37.12:6789,10.25.37.13:6789,10.25.37.11:6789, pool csi
E0930 10:06:28.619336       1 rbd_util.go:591] ID: 105522 Req-ID: 0001-0024-3bbd557f-6326-493f-a98d-3491181374a8-0000000000000003-bab0957c-3a77-11ed-a1f6-9234d9da000a failed to list images in trash: rbd: ret=-34, Numerical result out of range
E0930 10:06:28.619402       1 utils.go:200] ID: 105522 Req-ID: 0001-0024-3bbd557f-6326-493f-a98d-3491181374a8-0000000000000003-bab0957c-3a77-11ed-a1f6-9234d9da000a GRPC error: rpc error: code = Internal desc = rbd: ret=-34, Numerical result out of range
@Rakshith-R
Copy link
Contributor

Describe the feature you'd like to have
I want image snapshots to be deleted on image delete request. So basically smth like this:

@vriabyk
Were these rbd image snapshots you are talking about created manually through ceph cli or dashboard ?
If so, it is the user's responsibility to delete these snapshots themselves.

@vriabyk
Copy link
Author

vriabyk commented Oct 7, 2022

@Rakshith-R, yes, snapshot was created manually via rbd cli. Any problem to add this logic from your side? We can create simple PR for that. You may at least implement some option which will be disabled by default, but if enabled - will delete snapshots before deleting image.

Otherwise, ceph csi shouldn't delete pvc/pv from k8s if the image has snapshots in ceph and throw error message like: "Can't delete pvc because it has snapshots in ceph".
At some point ceph csi stops provisioning new volumes once rbd trash is full.

@Rakshith-R
Copy link
Contributor

@Rakshith-R, yes, snapshot was created manually via rbd cli. Any problem to add this logic from your side? We can create simple PR for that. You may at least implement some option which will be disabled by default, but if enabled - will delete snapshots before deleting image.

Otherwise, ceph csi shouldn't delete pvc/pv from k8s if the image has snapshots in ceph and through error message like: "Can't delete pvc because it has snapshots in ceph". At some point ceph csi stops provisioning new volumes once rbd trash is full.

@vriabyk
I would prefer the user who has created the snapshot to delete it.
The deletion of image being blocked due presence of a rbd snapshot which was explicitly created outside of cephcsi makes sense to me. I don't think we need to add a step to purge something that was created outside of cephcsi scope.

cc @ceph/ceph-csi-contributors

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Oct 7, 2022

yes agree to above point, if user created the snapshots its user responsibility to delete it before delete the pvc or snapshots. i dont we should purge snapshots in the rbd image before deleting.

@crabique
Copy link

crabique commented Oct 7, 2022

This does sound like something that can be handled at CSI level, provisioning and deprovisioning volumes are the main things expected of an interface.

Things in ceph ecosystem usually handle snapshots that way, they are transparently purged upon RBD volume deletion unless they are protected. Most common scenario here is not that they are "manually" created by some user, but some backup routine created them talking directly to ceph to have a reliable versioned source volume, made its backups and will only need it for diffing purposes if this volume continues to exist and will need to be backed up again.

How do you imagine this working automatically at any scale? When a pvc gets deleted from kubernetes, ceph-csi successfully removes k8s objects, nothing else but the CSI controller gets to receive the deletion request and it virtually does nothing about it, deceiving the user the volume has been deleted while it's been not.

So if anyone is expected to approach this issue in an automated way, the solution would have to either be a some sort of a cronjob or to source data by means of dumpster diving, both of which sound like extreme workarounds for something that can be handled properly. It would seem logical that a CSI implementation should be able to properly handle volume deletion in this "edge" case, if this is deemed to dangerous this could be an opt-in flag, or even something more granular like an annotation.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Oct 7, 2022

IMHO its not a best option to handle it at the cephcsi level which is created by some other entity. if some external component is doing some operation at the ceph level we should expect to redo the operation it has done before deleting the volume. providing this functionality behind a flag is not a problem but it doesnt sound right. if snapshots are created by some external entity we should expect it to delete it also.

@crabique
Copy link

crabique commented Oct 7, 2022

How would the entity responsible for snapshot creation know to delete the snapshots before ceph-csi attempts to delete the volume? As far as I'm aware, there is currently no way to configure an external webhook dependency for the finalizer that would cause ceph-csi to wait until the snapshots are deleted either.

The issue here is that it's a responsibility scope for the CSI implementation to either do something about it or delegate the problem to something external, however at the moment it does neither.

@Rakshith-R
Copy link
Contributor

How would the entity responsible for snapshot creation know to delete the snapshots before ceph-csi attempts to delete the volume? As far as I'm aware, there is currently no way to configure an external webhook dependency for the finalizer that would cause ceph-csi to wait until the snapshots are deleted either.

The issue here is that it's a responsibility scope for the CSI implementation to either do something about it or delegate the problem to something external, however at the moment it does neither.

@crabique
I disagree, cephcsi operates at csi level as an interface between orchestrator and ceph.

It is not responsible for anything that is done at ceph level by another user/entity on the images created by cephcsi. They need to cleanup after themselves, maybe by having a routine that checks images in trash for snaps created by it and purging the related snap?

Snap purge at cephcsi does not sound like a good idea, cephcsi also creates/deletes k8s snapshot, clones of pvc which have snapshots links to the parent images and still continue to exist when parent pvcs are deleted.
( the rbd images in trash have a task to remove them after the child images are deleted)
Refer https://github.com/ceph/ceph-csi/blob/devel/docs/design/proposals/rbd-snap-clone.md .

ruslanloman added a commit to ruslanloman/ceph-csi that referenced this issue Oct 19, 2022
  Ceph csi couldn't remove pvc with snapshots that was created directly in
  ceph using rbd or other clients. As a result, the rbd image remains
  permanently in the trash.

  Fixes: ceph#3416

  Signed-off-by: ruslanloman <[email protected]>
ruslanloman added a commit to ruslanloman/ceph-csi that referenced this issue Oct 19, 2022
  Ceph csi couldn't remove pvc with snapshots that was created directly in
  ceph using rbd or other clients. As a result, the rbd image remains
  permanently in the trash.

  Fixes: ceph#3416

  Signed-off-by: ruslanloman <[email protected]>
ruslanloman added a commit to ruslanloman/ceph-csi that referenced this issue Oct 19, 2022
  Ceph csi couldn't remove pvc with snapshots that was created directly in
  ceph using rbd or other clients. As a result, the rbd image remains
  permanently in the trash.

  Fixes: ceph#3416

Signed-off-by: ruslanloman <[email protected]>
@nixpanic nixpanic added question Further information is requested component/rbd Issues related to RBD labels Oct 19, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the wontfix This will not be worked on label Nov 18, 2022
@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2022
ruslanloman added a commit to ruslanloman/ceph-csi that referenced this issue Dec 27, 2022
  Ceph csi couldn't remove pvc with snapshots that was created directly in
  ceph using rbd or other clients. As a result, the rbd image remains
  permanently in the trash.

  Fixes: ceph#3416

Signed-off-by: ruslanloman <[email protected]>
ruslanloman added a commit to ruslanloman/ceph-csi that referenced this issue Jan 4, 2023
  Ceph csi couldn't remove pvc with snapshots that was created directly in
  ceph using rbd or other clients. As a result, the rbd image remains
  permanently in the trash.

  Fixes: ceph#3416

Signed-off-by: ruslanloman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants