Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

lrstanley · 2024-09-23T21:12:21Z

What steps did you take and what happened:

It looks as though with the latest version of the velero-plugin-for-aws plugin is incorrectly utilizing IRSA. It looks like it is using the nodes attached role, rather than the role attached to the service account.

What did you expect to happen:

If an IRSA role is attached to the service account velero is using, I would expect it to use that role.

The following information will help us better understand what's going on:

Unable to provide a support bundle due to the sensitivity of this cluster. With that being said, hopefully this is enough information.

Errors:

time="2024-09-23T20:45:30Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/1/0, BackupStorageLocation \"default\" is unavailable: rpc error: code = Unknown desc = operation error S3: ListObjectsV2, https response error StatusCode: 403, RequestID: TRUNCATED, HostID: TRUNCATED, api error AccessDenied: User: arn:aws:sts::TRUNCATED:assumed-role/TRUNCATED-worker/TRUNCATED is not authorized to perform: s3:ListBucket on resource: \"arn:aws:s3:::TRUNCATED-prod-velero\" because no identity-based policy allows the s3:ListBucket action)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:178"
time="2024-09-23T20:45:30Z" level=info msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location id=200 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws
time="2024-09-23T20:46:20Z" level=error msg="Error listing backups in backup store" backupLocation=velero/default controller=backup-sync error="rpc error: code = Unknown desc = operation error S3: ListObjectsV2, https response error StatusCode: 403, RequestID: TRUNCATED, HostID: TRUNCATED, api error AccessDenied: User: arn:aws:sts::TRUNCATED:assumed-role/TRUNCATED-worker/TRUNCATED is not authorized to perform: s3:ListBucket on resource: \"arn:aws:s3:::TRUNCATED-prod-velero\" because no identity-based policy allows the s3:ListBucket action" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:351" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:109"
time="2024-09-23T20:46:20Z" level=info msg="plugin process exited" backupLocation=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-sync id=213 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws

Helm chart configuration:

resources:
  requests:
    cpu: 250m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 1.5Gi
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.10.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
configuration:
  extraEnvVars:
    GOMEMLIMIT: 1024MiB
  fsBackupTimeout: 480m
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: TRUNCATED-${attribute_aws_account_env}-velero
      default: true
      config:
        region: us-east-1
  volumeSnapshotLocation:
    - name: default
      provider: aws
      config:
        region: us-east-1
  # Set true for backup all pod volumes without having to apply annotation on the pod when used file system backup Default: false.
  defaultVolumesToFsBackup: false
backupsEnabled: true
snapshotsEnabled: true
deployNodeAgent: true
# credentials:
#   useSecret: true
#   existingSecret: velero-s3
serviceAccount:
  server:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: "${attribute_role_arn_velero}"
metrics:
  serviceMonitor:
    enabled: true
  prometheusRule:
    enabled: true
    spec:
      - alert: VeleroBackupPartialFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partialy failed backups.
        expr: |-
          velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning
      - alert: VeleroBackupFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failed backups.
        expr: |-
          velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning
nodeAgent:
  extraEnvVars:
    GOMEMLIMIT: 2048MiB
  resources:
    requests:
      cpu: 250m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
schedules:
  default-weekly:
    schedule: "0 3 * * 6"
    template:
      includedNamespaces:
        - "*"
      ttl: 1440h0m0s
      storageLocation: default
    useOwnerReferencesInBackup: false
  default-daily:
    schedule: "0 5 * * *"
    template:
      includedNamespaces:
        - "*"
      ttl: 168h0m0s
      storageLocation: default
    useOwnerReferencesInBackup: false

service account yaml, directly from the cluster, showing the appropriate annotation:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::TRUNCATED:role/TRUNCATED-velero
    meta.helm.sh/release-name: velero
    meta.helm.sh/release-namespace: velero
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-7.2.1
  name: velero-server
  namespace: velero
  # [...]

Given the above output, it looks like Velero is using the default role from IMDS/the ec2 worker role, not the IRSA role. Worth noting that prior to this version, we were on 1.10.x, and IRSA was working without issue. Looks like the switch to sdk-v2 has caused some issues.

May also be related to the following issues:

BSL.spec.config["credentialsFile"] overrides AWS Web Identity Token credentials from pod environment #7302
Backup results in Access Denied for PutObject operation to S3 #8062 (tried adding PutObjectTagging, but because it's not using the right role, don't think this matters)
Force to Credentials file when IRSA is configured #7374

Environment:

Velero version (use velero version): 1.14.1
Velero features (use velero client config get features): n/a
Kubernetes version (use kubectl version): 1.24 (though based on how IRSA works, don't think the older version should be an issue).
Kubernetes installer & version: EKS
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): n/a

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

reasonerjt self-assigned this Sep 24, 2024

reasonerjt added Area/Cloud/AWS Needs investigation labels Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

lrstanley commented Sep 23, 2024

Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

Comments

lrstanley commented Sep 23, 2024