-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-agent not waiting for PV removal #8772
Comments
Let me try to sort out what happens here.
If my understanding is correct, then the enhancement should modify the backup restore resource cleaning order. |
Thanks for the good catch and analysis. Please submit the PR and let's discuss in the PR for minor changes. The changes are based on the fact that the FlexClone has been fully deleted from the storage when the backupPV disappears from the cluster, otherwise, it still doesn't help. But I think Kubernetes and CSI driver should be working in this way and guarantee this. |
@blackpiglet Let's track the cleanup logics for error handling separately. Probably, for various cases in error handling, make |
@Lyndon-Li Indeed, K8s and the CSI driver should take care of this, assuming a correct CSI driver implementation. I will submit the PR ASAP. |
* ensure pv has been deleted Signed-off-by: Roger Zimmermann <[email protected]> * ensure delete pv unit test Signed-off-by: Roger Zimmermann <[email protected]> * comment, errors Signed-off-by: Roger Zimmermann <[email protected]> * updated changelog Signed-off-by: Roger Zimmermann <[email protected]> Signed-off-by: Roger Zimmermann <[email protected]> * pass value Co-authored-by: Tiger Kaovilai <[email protected]> Signed-off-by: Roger Zimmermann <[email protected]> * function renamed as suggested Signed-off-by: Roger Zimmermann <[email protected]> --------- Signed-off-by: Roger Zimmermann <[email protected]> Co-authored-by: Tiger Kaovilai <[email protected]>
* Issue vmware-tanzu#8772 ensure pv removed (vmware-tanzu#8777) * ensure pv has been deleted Signed-off-by: Roger Zimmermann <[email protected]> * ensure delete pv unit test Signed-off-by: Roger Zimmermann <[email protected]> * comment, errors Signed-off-by: Roger Zimmermann <[email protected]> * updated changelog Signed-off-by: Roger Zimmermann <[email protected]> Signed-off-by: Roger Zimmermann <[email protected]> * pass value Co-authored-by: Tiger Kaovilai <[email protected]> Signed-off-by: Roger Zimmermann <[email protected]> * function renamed as suggested Signed-off-by: Roger Zimmermann <[email protected]> --------- Signed-off-by: Roger Zimmermann <[email protected]> Co-authored-by: Tiger Kaovilai <[email protected]> * issue8720: log doesn't show pv name (vmware-tanzu#8771) * fix: log doesn't show pv name Signed-off-by: hu-keyu <[email protected]> * fix: add changelog Signed-off-by: hu-keyu <[email protected]> * update changelog fileName Signed-off-by: hu-keyu <[email protected]> --------- Signed-off-by: hu-keyu <[email protected]> * Bump kind cli to v0.27.0 (vmware-tanzu#8699) Signed-off-by: Tiger Kaovilai <[email protected]> * Modify how the restore workflow using the resource name. The restore workflow used name represents the backup resource and the restore to be restored, but the restored resource name may be different from the backup one, e.g. PV and VSC are global resources, to avoid conflict, need to rename them. Reanme the name variable to backupResourceName, and use obj.GetName() for restore operation. Signed-off-by: Xun Jiang <[email protected]> * Enable containerdv2 images Fixes vmware-tanzu#8648 Signed-off-by: Tiger Kaovilai <[email protected]> * Bump github.com/golang-jwt/jwt/v5 from 5.2.1 to 5.2.2 (vmware-tanzu#8806) Bumps [github.com/golang-jwt/jwt/v5](https://github.com/golang-jwt/jwt) from 5.2.1 to 5.2.2. - [Release notes](https://github.com/golang-jwt/jwt/releases) - [Changelog](https://github.com/golang-jwt/jwt/blob/main/VERSION_HISTORY.md) - [Commits](golang-jwt/jwt@v5.2.1...v5.2.2) --- updated-dependencies: - dependency-name: github.com/golang-jwt/jwt/v5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Document schedule skipImmediately (vmware-tanzu#8802) Fixes vmware-tanzu#8787 Signed-off-by: Tiger Kaovilai <[email protected]> * issue 8803: use deterministic name to create backupRepository Signed-off-by: Lyndon-Li <[email protected]> * add third party annotation support for maintenance job Signed-off-by: Lyndon-Li <[email protected]> * Bump the golang.org/x/net to v0.36.0 to fix Restic CVE. Signed-off-by: Xun Jiang <[email protected]> * update readme and implemented design for 1.16 Signed-off-by: Lyndon-Li <[email protected]> * add doc for upgrade to 1.16 Signed-off-by: Lyndon-Li <[email protected]> * add 1.16 doc Signed-off-by: Lyndon-Li <[email protected]> * Fix the non data mover migration failure. Migration cases use the Kibishii as the workload, and SC mapping ConfigMap was needed for all scenarios, because standby cluster doesn't have the Kibishii SC after setting up. Signed-off-by: Xun Jiang <[email protected]> * add 1.16 changelog Signed-off-by: Lyndon-Li <[email protected]> * Bump the migration and upgrade E2E test source version. Add v1.16 related plugin and other image default version. Signed-off-by: Xun Jiang <[email protected]> * Align the E2E upgrade test's CLI and image version. Signed-off-by: Xun Jiang <[email protected]> * pin velero image Signed-off-by: Lyndon-Li <[email protected]> * skip subresource in resource discovery (vmware-tanzu#6688) Signed-off-by: lou <[email protected]> Co-authored-by: lou <[email protected]> * fix issue 6753 Signed-off-by: Lyndon-Li <[email protected]> * Update restore controller logic for restore deletion (vmware-tanzu#6761) 1. Skip deleting the restore files from storage if the backup/BSL is not found 2. Allow deleting the restore files from storage even though the BSL is readonly Signed-off-by: Wenkai Yin(尹文开) <[email protected]> * Fix vmware-tanzu#6752: add namespace exclude check. Add PSA audit and warn labels. Signed-off-by: Xun Jiang <[email protected]> * add csi snapshot data movement doc Signed-off-by: Lyndon-Li <[email protected]> * Modify changelogs for v1.12 Signed-off-by: allenxu404 <[email protected]> * issue 6786:always delete VSC regardless of the deletion policy Signed-off-by: Lyndon-Li <[email protected]> * issue: move plugin depdending podvolume functions to util pkg Signed-off-by: Lyndon-Li <[email protected]> * issue 6880: set ParallelUploadAboveSize as MaxInt64 Signed-off-by: Lyndon-Li <[email protected]> * changelog Signed-off-by: Tiger Kaovilai <[email protected]> * Add support for block volumes (vmware-tanzu#6680) (vmware-tanzu#6897) (cherry picked from commit 8e01d1b) Signed-off-by: David Zaninovic <[email protected]> * Replace the base image with paketobuildpacks image Replace the base image with paketobuildpacks image Fixes vmware-tanzu#6851 Signed-off-by: Wenkai Yin(尹文开) <[email protected]> * issue 6734: spread backup pod evenly Signed-off-by: Lyndon-Li <[email protected]> * Add doc links for new features to release note Signed-off-by: allenxu404 <[email protected]> * fix issue 6647 Signed-off-by: Lyndon-Li <[email protected]> * Perf improvements for existing resource restore Use informer cache with dynamic client for Get calls on restore When enabled, also make the Get call before create. Add server and install parameter to allow disabling this feature, but enable by default Signed-off-by: Scott Seago <[email protected]> * issue vmware-tanzu#6807: Retry failed create when using generateName When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <[email protected]> * Import auth provider plugins Signed-off-by: Sebastian Glab <[email protected]> * Add v1.12.1 changelog Signed-off-by: allenxu404 <[email protected]> * Make Windows build skip BlockMode code. PVC block mode backup and restore introduced some OS specific system calls. Those calls are not available for Windows, so add both non Windows version and Windows version code, and return error for block mode on the Windows platform. Signed-off-by: Xun Jiang <[email protected]> * udmrepo use region specified in BSL when s3URL is empty Signed-off-by: Lyndon-Li <[email protected]> * Change v1.12.1 changelog Signed-off-by: allenxu404 <[email protected]> * Dockerfile.ubi/travis local files add UBI dockerfiles Use numeric user for velero-restic-restore-helper Enable multiarch builds (#135) Use arm64-graviton2 for arm builds (#137) Add required keys for arm builds (#139) Update Travis build job to work w/o changes on new branches Use a full VM for arm Use numeric non-root user for nonroot SCC compatibility * Add BZ + Publish automation to repo (#82) (cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <[email protected]> (cherry picked from commit ecc563f) Add publish workflow (#108) (cherry picked from commit f87b779) * remove dependabot config from fork * Create Makefile.prow Code-gen no longer required on verify due to vmware-tanzu#6039 Signed-off-by: Tiger Kaovilai <[email protected]> oadp-1.2: Update Makefile.prow to velero-restore-helper * set HOME in velero image for kopia, update controller-gen for CI (#280) Signed-off-by: Scott Seago <[email protected]> * build velero-helper binary for datamover pod * restore: Use warning when Create IsAlreadyExist and Get error Signed-off-by: Tiger Kaovilai <[email protected]> * kopia/repository/config/aws.go: Set session.Options profile from config Signed-off-by: Tiger Kaovilai <[email protected]> * use ubi9-latest to build * OADP-4225: add tzdata to Dockerfile.ubi * fix: CI (#316) Signed-off-by: Mateus Oliveira <[email protected]> * fix: ARM images (#332) * fix: ARM images Signed-off-by: Mateus Oliveira <[email protected]> * fixup! fix: ARM images Signed-off-by: Mateus Oliveira <[email protected]> --------- Signed-off-by: Mateus Oliveira <[email protected]> * ubi: BUILDPLATFORM to build stage to enable cross compile. (#336) Signed-off-by: Tiger Kaovilai <[email protected]> * OADP-4640: Downstream only to allow override kopia default algorithms (#334) (#338) add missing unit test for kopia hashing algo (#337) Introduction of downstream only option to override Kopia default: - hashing algorithm - splitting algorithm - encryption algorithm With introduction of 3 environment variables it is possible to override Kopia algorithms used by Velero: KOPIA_HASHING_ALGORITHM KOPIA_SPLITTER_ALGORITHM KOPIA_ENCRYPTION_ALGORITHM If the env algorithms are not set or they are not within Kopia SupportedAlgorithms, the default algorithm will be used. This behavior is consistent with current behavior without this change. Signed-off-by: Michal Pryc <[email protected]> Signed-off-by: Shubham Pampattiwar <[email protected]> * Downstream only: Rework of Makefile and incusion of lint The rework of Makefile to make it more readable and inclusion of lint as a target as well extract golangci-lint version from the upstream Dockerfile, so we test in PROW or locally on the same version as upstream. Signed-off-by: Michal Pryc <[email protected]> * Downstream only - fix lint error in downtream change (#343) This fixes the PR #334 where one additional line was in the code. This was not exposed previously as we did not had downstream CI Lint jobs. Signed-off-by: Michal Pryc <[email protected]> * run oadp-operator e2e test from the velero repo (#353) * run oadp-operator e2e test from the velero repo execute openshift/oadp-operator e2e tests directly against the velero repo locally or via prow ci Signed-off-by: Wesley Hayutin <[email protected]> * update variable names, add a cleanup * make sure env variable overrides default velero_image Signed-off-by: Wesley Hayutin <[email protected]> * add options to build, push, and only test Signed-off-by: Wesley Hayutin <[email protected]> * add arch to name Signed-off-by: Wesley Hayutin <[email protected]> * remove duplicated clean/rm operator checkout * simplify by dropping export var and use a oneliner Co-authored-by: Tiger Kaovilai <[email protected]> * drop export and use oneliner Co-authored-by: Tiger Kaovilai <[email protected]> * just in case, allow oadp to be deployed from makefile Signed-off-by: Wesley Hayutin <[email protected]> * Update Makefile.prow Co-authored-by: Tiger Kaovilai <[email protected]> --------- Signed-off-by: Wesley Hayutin <[email protected]> Co-authored-by: Tiger Kaovilai <[email protected]> * DS Owners * updated controller-gen version * Include velero-restore-helper binary in velero image (#375) Co-authored-by: Scott Seago <[email protected]> --------- Signed-off-by: Roger Zimmermann <[email protected]> Signed-off-by: hu-keyu <[email protected]> Signed-off-by: Tiger Kaovilai <[email protected]> Signed-off-by: Xun Jiang <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Lyndon-Li <[email protected]> Signed-off-by: lou <[email protected]> Signed-off-by: Wenkai Yin(尹文开) <[email protected]> Signed-off-by: Xun Jiang <[email protected]> Signed-off-by: allenxu404 <[email protected]> Signed-off-by: David Zaninovic <[email protected]> Signed-off-by: Scott Seago <[email protected]> Signed-off-by: Sebastian Glab <[email protected]> Signed-off-by: Mateus Oliveira <[email protected]> Signed-off-by: Michal Pryc <[email protected]> Signed-off-by: Shubham Pampattiwar <[email protected]> Signed-off-by: Wesley Hayutin <[email protected]> Co-authored-by: Roger Zimmermann <[email protected]> Co-authored-by: Tiger Kaovilai <[email protected]> Co-authored-by: hu-keyu <[email protected]> Co-authored-by: Tiger Kaovilai <[email protected]> Co-authored-by: Xun Jiang <[email protected]> Co-authored-by: lyndon-li <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Wenkai Yin(尹文开) <[email protected]> Co-authored-by: Lyndon-Li <[email protected]> Co-authored-by: Xun Jiang/Bruce Jiang <[email protected]> Co-authored-by: Daniel Jiang <[email protected]> Co-authored-by: lou <[email protected]> Co-authored-by: Xun Jiang <[email protected]> Co-authored-by: allenxu404 <[email protected]> Co-authored-by: David Zaninovic <[email protected]> Co-authored-by: Sebastian Glab <[email protected]> Co-authored-by: Dylan Murray <[email protected]> Co-authored-by: RayfordJ <[email protected]> Co-authored-by: Mateus Oliveira <[email protected]> Co-authored-by: Michal Pryc <[email protected]> Co-authored-by: Wesley Hayutin <[email protected]> Co-authored-by: OpenShift Cherrypick Robot <[email protected]>
What steps did you take and what happened:
The node-agent does not wait for the PV to be removed on the cluster before deleting the VolumeSnapshot. As we are using NetApp Trident and NetApp ONTAP based storage this affects us in the following way:
Most of the time the NetApp FlexClone that is made available as PVC for the data upload is still in a deleting state on the NetApp filer when the VolumeSnapshot is deleted as part of the cleanup. As a consequence, NetApp Trident identifies the snapshot that is the source of the FlexClone as 'busy' and will perform a split clone operation, turning the FlexClone into a real clone.
This means that all the data will be copied into an independent volume and will be deleted afterwards. While the waste of space is temporary, it can take hours to perform this operation and may have the potential to fill up the storage. Trident is not able to remove the volume that is the base for the PVC as long as the clone operation is in progress.
I do not think this is a problem with NetApp Trident!
Commands
velero install --provider aws
--bucket velero
--plugins velero/velero-plugin-for-aws:v1.11.0
--secret-file ./secret-file
--use-volume-snapshots=true
--backup-location-config region="minio",s3ForcePathStyle="true",s3Url=http://xxx.yyy.zzz.aaa:9000
--snapshot-location-config region="minio"
--use-node-agent
--features=EnableCSI
--node-agent-configmap=node-agent-config
velero backup create test01 --include-namespaces nginx --snapshot-move-data
What did you expect to happen:
node-agent waits for the PV to disappear before deleting the VolumeSnapshot.
The following information will help us better understand what's going on:
PoC code to wait for the PV delete. When running the forked node-agent the split clones no longer happen:
main...ix-rzi:velero:fix-ensure-pv-removed
I'm happy to discuss and open a PR.
Environment:
Note, this is a local dev/test environment!
/etc/os-release
):NAME="Fedora Linux"
VERSION="41 (Server Edition)"
RELEASE_TYPE=stable
ID=fedora
VERSION_ID=41
VERSION_CODENAME=""
PLATFORM_ID="platform:f41"
PRETTY_NAME="Fedora Linux 41 (Server Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:41"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f41/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=41
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=41
SUPPORT_END=2025-12-15
VARIANT="Server Edition"
VARIANT_ID=server
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: