Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-agent not waiting for PV removal #8772

Closed
ix-rzi opened this issue Mar 10, 2025 · 4 comments · Fixed by #8777
Closed

node-agent not waiting for PV removal #8772

ix-rzi opened this issue Mar 10, 2025 · 4 comments · Fixed by #8777
Assignees
Milestone

Comments

@ix-rzi
Copy link
Contributor

ix-rzi commented Mar 10, 2025

What steps did you take and what happened:

The node-agent does not wait for the PV to be removed on the cluster before deleting the VolumeSnapshot. As we are using NetApp Trident and NetApp ONTAP based storage this affects us in the following way:

Most of the time the NetApp FlexClone that is made available as PVC for the data upload is still in a deleting state on the NetApp filer when the VolumeSnapshot is deleted as part of the cleanup. As a consequence, NetApp Trident identifies the snapshot that is the source of the FlexClone as 'busy' and will perform a split clone operation, turning the FlexClone into a real clone.

This means that all the data will be copied into an independent volume and will be deleted afterwards. While the waste of space is temporary, it can take hours to perform this operation and may have the potential to fill up the storage. Trident is not able to remove the volume that is the base for the PVC as long as the clone operation is in progress.

I do not think this is a problem with NetApp Trident!

Commands
velero install --provider aws
--bucket velero
--plugins velero/velero-plugin-for-aws:v1.11.0
--secret-file ./secret-file
--use-volume-snapshots=true
--backup-location-config region="minio",s3ForcePathStyle="true",s3Url=http://xxx.yyy.zzz.aaa:9000
--snapshot-location-config region="minio"
--use-node-agent
--features=EnableCSI
--node-agent-configmap=node-agent-config

velero backup create test01 --include-namespaces nginx --snapshot-move-data

What did you expect to happen:

node-agent waits for the PV to disappear before deleting the VolumeSnapshot.

The following information will help us better understand what's going on:

PoC code to wait for the PV delete. When running the forked node-agent the split clones no longer happen:

main...ix-rzi:velero:fix-ensure-pv-removed

I'm happy to discuss and open a PR.

Environment:

Note, this is a local dev/test environment!

  • Velero version : v1.15.2
  • Velero features :
  • Kubernetes version: v1.31.5+k3s1
  • Kubernetes installer & version: K3s
  • Cloud provider or hardware configuration: K3s on Fedora 41
  • OS (e.g. from /etc/os-release):
    NAME="Fedora Linux"
    VERSION="41 (Server Edition)"
    RELEASE_TYPE=stable
    ID=fedora
    VERSION_ID=41
    VERSION_CODENAME=""
    PLATFORM_ID="platform:f41"
    PRETTY_NAME="Fedora Linux 41 (Server Edition)"
    ANSI_COLOR="0;38;2;60;110;180"
    LOGO=fedora-logo-icon
    CPE_NAME="cpe:/o:fedoraproject:fedora:41"
    HOME_URL="https://fedoraproject.org/"
    DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f41/system-administrators-guide/"
    SUPPORT_URL="https://ask.fedoraproject.org/"
    BUG_REPORT_URL="https://bugzilla.redhat.com/"
    REDHAT_BUGZILLA_PRODUCT="Fedora"
    REDHAT_BUGZILLA_PRODUCT_VERSION=41
    REDHAT_SUPPORT_PRODUCT="Fedora"
    REDHAT_SUPPORT_PRODUCT_VERSION=41
    SUPPORT_END=2025-12-15
    VARIANT="Server Edition"
    VARIANT_ID=server

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@blackpiglet
Copy link
Contributor

blackpiglet commented Mar 11, 2025

Let me try to sort out what happens here.

  • The problem happens when Velero deletes the backup or restore Pod, PVC, PV, VS, and VSC.
  • The unwanted is that the backup or restore PVC's data source, which is the backup or restore VS, are deleted before the PVC is deleted. That caused the full clone of the PVC volume from the snapshot. I think that is reasonable because the volume source is gone, and the solution should be making the volume independent from the snapshot source.

If my understanding is correct, then the enhancement should modify the backup restore resource cleaning order.
Your modification should work, but I think we should also consider whether there are some error cases that may also have a similar issue.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Mar 11, 2025

@ix-rzi

PVC for the data upload is still in a deleting state on the NetApp filer when the VolumeSnapshot is deleted as part of the cleanup

Thanks for the good catch and analysis. Please submit the PR and let's discuss in the PR for minor changes.

The changes are based on the fact that the FlexClone has been fully deleted from the storage when the backupPV disappears from the cluster, otherwise, it still doesn't help. But I think Kubernetes and CSI driver should be working in this way and guarantee this.

@Lyndon-Li
Copy link
Contributor

@blackpiglet Let's track the cleanup logics for error handling separately. Probably, for various cases in error handling, make DeletePVAndPVCIfAny do the same wait is a good idea.

@Lyndon-Li Lyndon-Li added this to the v1.16 milestone Mar 11, 2025
@ix-rzi
Copy link
Contributor Author

ix-rzi commented Mar 11, 2025

@Lyndon-Li Indeed, K8s and the CSI driver should take care of this, assuming a correct CSI driver implementation. I will submit the PR ASAP.

kaovilai added a commit that referenced this issue Mar 13, 2025
* ensure pv has been deleted

Signed-off-by: Roger Zimmermann <[email protected]>

* ensure delete pv unit test

Signed-off-by: Roger Zimmermann <[email protected]>

* comment, errors

Signed-off-by: Roger Zimmermann <[email protected]>

* updated changelog
Signed-off-by: Roger Zimmermann <[email protected]>

Signed-off-by: Roger Zimmermann <[email protected]>

* pass value

Co-authored-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Roger Zimmermann <[email protected]>

* function renamed as suggested

Signed-off-by: Roger Zimmermann <[email protected]>

---------

Signed-off-by: Roger Zimmermann <[email protected]>
Co-authored-by: Tiger Kaovilai <[email protected]>
openshift-merge-bot bot pushed a commit to openshift/velero that referenced this issue Apr 3, 2025
* Issue vmware-tanzu#8772 ensure pv removed (vmware-tanzu#8777)

* ensure pv has been deleted

Signed-off-by: Roger Zimmermann <[email protected]>

* ensure delete pv unit test

Signed-off-by: Roger Zimmermann <[email protected]>

* comment, errors

Signed-off-by: Roger Zimmermann <[email protected]>

* updated changelog
Signed-off-by: Roger Zimmermann <[email protected]>

Signed-off-by: Roger Zimmermann <[email protected]>

* pass value

Co-authored-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Roger Zimmermann <[email protected]>

* function renamed as suggested

Signed-off-by: Roger Zimmermann <[email protected]>

---------

Signed-off-by: Roger Zimmermann <[email protected]>
Co-authored-by: Tiger Kaovilai <[email protected]>

* issue8720: log doesn't show pv name (vmware-tanzu#8771)

* fix: log doesn't show pv name

Signed-off-by: hu-keyu <[email protected]>

* fix: add changelog

Signed-off-by: hu-keyu <[email protected]>

* update changelog fileName

Signed-off-by: hu-keyu <[email protected]>

---------

Signed-off-by: hu-keyu <[email protected]>

* Bump kind cli to v0.27.0 (vmware-tanzu#8699)

Signed-off-by: Tiger Kaovilai <[email protected]>

* Modify how the restore workflow using the resource name.

The restore workflow used name represents the backup resource and the
restore to be restored, but the restored resource name may be different
from the backup one, e.g. PV and VSC are global resources, to avoid
conflict, need to rename them.
Reanme the name variable to backupResourceName, and use obj.GetName()
for restore operation.

Signed-off-by: Xun Jiang <[email protected]>

* Enable containerdv2 images

Fixes vmware-tanzu#8648

Signed-off-by: Tiger Kaovilai <[email protected]>

* Bump github.com/golang-jwt/jwt/v5 from 5.2.1 to 5.2.2 (vmware-tanzu#8806)

Bumps [github.com/golang-jwt/jwt/v5](https://github.com/golang-jwt/jwt) from 5.2.1 to 5.2.2.
- [Release notes](https://github.com/golang-jwt/jwt/releases)
- [Changelog](https://github.com/golang-jwt/jwt/blob/main/VERSION_HISTORY.md)
- [Commits](golang-jwt/jwt@v5.2.1...v5.2.2)

---
updated-dependencies:
- dependency-name: github.com/golang-jwt/jwt/v5
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Document schedule skipImmediately (vmware-tanzu#8802)

Fixes vmware-tanzu#8787

Signed-off-by: Tiger Kaovilai <[email protected]>

* issue 8803: use deterministic name to create backupRepository

Signed-off-by: Lyndon-Li <[email protected]>

* add third party annotation support for maintenance job

Signed-off-by: Lyndon-Li <[email protected]>

* Bump the golang.org/x/net to v0.36.0 to fix Restic CVE.

Signed-off-by: Xun Jiang <[email protected]>

* update readme and implemented design for 1.16

Signed-off-by: Lyndon-Li <[email protected]>

* add doc for upgrade to 1.16

Signed-off-by: Lyndon-Li <[email protected]>

* add 1.16 doc

Signed-off-by: Lyndon-Li <[email protected]>

* Fix the non data mover migration failure.

Migration cases use the Kibishii as the workload, and SC mapping
ConfigMap was needed for all scenarios, because standby cluster
doesn't have the Kibishii SC after setting up.

Signed-off-by: Xun Jiang <[email protected]>

* add 1.16 changelog

Signed-off-by: Lyndon-Li <[email protected]>

* Bump the migration and upgrade E2E test source version.

Add v1.16 related plugin and other image default version.

Signed-off-by: Xun Jiang <[email protected]>

* Align the E2E upgrade test's CLI and image version.

Signed-off-by: Xun Jiang <[email protected]>

* pin velero image

Signed-off-by: Lyndon-Li <[email protected]>

* skip subresource in resource discovery (vmware-tanzu#6688)

Signed-off-by: lou <[email protected]>
Co-authored-by: lou <[email protected]>

* fix issue 6753

Signed-off-by: Lyndon-Li <[email protected]>

* Update restore controller logic for restore deletion (vmware-tanzu#6761)

1. Skip deleting the restore files from storage if the backup/BSL is not found
2. Allow deleting the restore files from storage even though the BSL is readonly

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>

* Fix vmware-tanzu#6752: add namespace exclude check.

Add PSA audit and warn labels.

Signed-off-by: Xun Jiang <[email protected]>

* add csi snapshot data movement doc

Signed-off-by: Lyndon-Li <[email protected]>

* Modify changelogs for v1.12

Signed-off-by: allenxu404 <[email protected]>

* issue 6786:always delete VSC regardless of the deletion policy

Signed-off-by: Lyndon-Li <[email protected]>

* issue: move plugin depdending podvolume functions to util pkg

Signed-off-by: Lyndon-Li <[email protected]>

* issue 6880: set ParallelUploadAboveSize as MaxInt64

Signed-off-by: Lyndon-Li <[email protected]>

* changelog

Signed-off-by: Tiger Kaovilai <[email protected]>

* Add support for block volumes (vmware-tanzu#6680) (vmware-tanzu#6897)

(cherry picked from commit 8e01d1b)

Signed-off-by: David Zaninovic <[email protected]>

* Replace the base image with paketobuildpacks image

Replace the base image with paketobuildpacks image

Fixes vmware-tanzu#6851

Signed-off-by: Wenkai Yin(尹文开) <[email protected]>

* issue 6734: spread backup pod evenly

Signed-off-by: Lyndon-Li <[email protected]>

* Add doc links for new features to release note

Signed-off-by: allenxu404 <[email protected]>

* fix issue 6647

Signed-off-by: Lyndon-Li <[email protected]>

* Perf improvements for existing resource restore

Use informer cache with dynamic client for Get calls on restore
When enabled, also make the Get call before create.

Add server and install parameter to allow disabling this feature,
but enable by default

Signed-off-by: Scott Seago <[email protected]>

* issue vmware-tanzu#6807: Retry failed create when using generateName

When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <[email protected]>

* Import auth provider plugins

Signed-off-by: Sebastian Glab <[email protected]>

* Add v1.12.1 changelog

Signed-off-by: allenxu404 <[email protected]>

* Make Windows build skip BlockMode code.

PVC block mode backup and restore introduced some OS specific
system calls. Those calls are not available for Windows, so
add both non Windows version and Windows version code, and
return error for block mode on the Windows platform.

Signed-off-by: Xun Jiang <[email protected]>

* udmrepo use region specified in BSL when s3URL is empty

Signed-off-by: Lyndon-Li <[email protected]>

* Change v1.12.1 changelog

Signed-off-by: allenxu404 <[email protected]>

* Dockerfile.ubi/travis local files

add UBI dockerfiles
Use numeric user for velero-restic-restore-helper
Enable multiarch builds (#135)
Use arm64-graviton2 for arm builds (#137)
Add required keys for arm builds (#139)
Update Travis build job to work w/o changes on new branches
Use a full VM for arm
Use numeric non-root user for nonroot SCC compatibility

* Add BZ + Publish automation to repo (#82)

(cherry picked from commit ccb545f)

Update PR-BZ automation mapping (#84)

(cherry picked from commit aa2b019)

Update PR-BZ automation (#92)

Co-authored-by: Rayford Johnson <[email protected]>
(cherry picked from commit ecc563f)

Add publish workflow (#108)

(cherry picked from commit f87b779)

* remove dependabot config from fork

* Create Makefile.prow

Code-gen no longer required on verify

due to vmware-tanzu#6039

Signed-off-by: Tiger Kaovilai <[email protected]>

oadp-1.2: Update Makefile.prow to velero-restore-helper

* set HOME in velero image for kopia, update controller-gen for CI (#280)

Signed-off-by: Scott Seago <[email protected]>

* build velero-helper binary for datamover pod

* restore: Use warning when Create IsAlreadyExist and Get error

Signed-off-by: Tiger Kaovilai <[email protected]>

* kopia/repository/config/aws.go: Set session.Options profile from config

Signed-off-by: Tiger Kaovilai <[email protected]>

* use ubi9-latest to build

* OADP-4225: add tzdata to Dockerfile.ubi

* fix: CI (#316)

Signed-off-by: Mateus Oliveira <[email protected]>

* fix: ARM images (#332)

* fix: ARM images

Signed-off-by: Mateus Oliveira <[email protected]>

* fixup! fix: ARM images

Signed-off-by: Mateus Oliveira <[email protected]>

---------

Signed-off-by: Mateus Oliveira <[email protected]>

* ubi: BUILDPLATFORM to build stage to enable cross compile. (#336)

Signed-off-by: Tiger Kaovilai <[email protected]>

* OADP-4640: Downstream only to allow override kopia default algorithms (#334) (#338)

add missing unit test for kopia hashing algo (#337)

Introduction of downstream only option to override Kopia default:
 - hashing algorithm
 - splitting algorithm
 - encryption algorithm

With introduction of 3 environment variables it is possible to override
Kopia algorithms used by Velero:

KOPIA_HASHING_ALGORITHM
KOPIA_SPLITTER_ALGORITHM
KOPIA_ENCRYPTION_ALGORITHM

If the env algorithms are not set or they are not within
Kopia SupportedAlgorithms, the default algorithm will be used.
This behavior is consistent with current behavior without this
change.

Signed-off-by: Michal Pryc <[email protected]>
Signed-off-by: Shubham Pampattiwar <[email protected]>

* Downstream only: Rework of Makefile and incusion of lint

The rework of Makefile to make it more readable and
inclusion of lint as a target as well extract
golangci-lint version from the upstream Dockerfile,
so we test in PROW or locally on the same version as upstream.

Signed-off-by: Michal Pryc <[email protected]>

* Downstream only - fix lint error in downtream change (#343)

This fixes the PR #334 where one additional line was
in the code. This was not exposed previously as we
did not had downstream CI Lint jobs.

Signed-off-by: Michal Pryc <[email protected]>

* run oadp-operator e2e test from the velero repo (#353)

* run oadp-operator e2e test from the velero repo

execute openshift/oadp-operator e2e tests directly
against the velero repo locally or via prow ci

Signed-off-by: Wesley Hayutin <[email protected]>

* update variable names, add a cleanup

* make sure env variable overrides default velero_image

Signed-off-by: Wesley Hayutin <[email protected]>

* add options to build, push, and only test

Signed-off-by: Wesley Hayutin <[email protected]>

* add arch to name

Signed-off-by: Wesley Hayutin <[email protected]>

* remove duplicated clean/rm operator checkout

* simplify by dropping export var and use a oneliner

Co-authored-by: Tiger Kaovilai <[email protected]>

* drop export and use oneliner

Co-authored-by: Tiger Kaovilai <[email protected]>

* just in case, allow oadp to be deployed from makefile

Signed-off-by: Wesley Hayutin <[email protected]>

* Update Makefile.prow

Co-authored-by: Tiger Kaovilai <[email protected]>

---------

Signed-off-by: Wesley Hayutin <[email protected]>
Co-authored-by: Tiger Kaovilai <[email protected]>

* DS Owners

* updated controller-gen version

* Include velero-restore-helper binary in velero image (#375)

Co-authored-by: Scott Seago <[email protected]>

---------

Signed-off-by: Roger Zimmermann <[email protected]>
Signed-off-by: hu-keyu <[email protected]>
Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Xun Jiang <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Lyndon-Li <[email protected]>
Signed-off-by: lou <[email protected]>
Signed-off-by: Wenkai Yin(尹文开) <[email protected]>
Signed-off-by: Xun Jiang <[email protected]>
Signed-off-by: allenxu404 <[email protected]>
Signed-off-by: David Zaninovic <[email protected]>
Signed-off-by: Scott Seago <[email protected]>
Signed-off-by: Sebastian Glab <[email protected]>
Signed-off-by: Mateus Oliveira <[email protected]>
Signed-off-by: Michal Pryc <[email protected]>
Signed-off-by: Shubham Pampattiwar <[email protected]>
Signed-off-by: Wesley Hayutin <[email protected]>
Co-authored-by: Roger Zimmermann <[email protected]>
Co-authored-by: Tiger Kaovilai <[email protected]>
Co-authored-by: hu-keyu <[email protected]>
Co-authored-by: Tiger Kaovilai <[email protected]>
Co-authored-by: Xun Jiang <[email protected]>
Co-authored-by: lyndon-li <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Wenkai Yin(尹文开) <[email protected]>
Co-authored-by: Lyndon-Li <[email protected]>
Co-authored-by: Xun Jiang/Bruce Jiang <[email protected]>
Co-authored-by: Daniel Jiang <[email protected]>
Co-authored-by: lou <[email protected]>
Co-authored-by: Xun Jiang <[email protected]>
Co-authored-by: allenxu404 <[email protected]>
Co-authored-by: David Zaninovic <[email protected]>
Co-authored-by: Sebastian Glab <[email protected]>
Co-authored-by: Dylan Murray <[email protected]>
Co-authored-by: RayfordJ <[email protected]>
Co-authored-by: Mateus Oliveira <[email protected]>
Co-authored-by: Michal Pryc <[email protected]>
Co-authored-by: Wesley Hayutin <[email protected]>
Co-authored-by: OpenShift Cherrypick Robot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants