Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

Open
lrstanley opened this issue Sep 23, 2024 · 0 comments
Open

Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

lrstanley opened this issue Sep 23, 2024 · 0 comments

Comments

@lrstanley
Copy link

What steps did you take and what happened:

It looks as though with the latest version of the velero-plugin-for-aws plugin is incorrectly utilizing IRSA. It looks like it is using the nodes attached role, rather than the role attached to the service account.

What did you expect to happen:

If an IRSA role is attached to the service account velero is using, I would expect it to use that role.

The following information will help us better understand what's going on:

Unable to provide a support bundle due to the sensitivity of this cluster. With that being said, hopefully this is enough information.

Errors:

time="2024-09-23T20:45:30Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/1/0, BackupStorageLocation \"default\" is unavailable: rpc error: code = Unknown desc = operation error S3: ListObjectsV2, https response error StatusCode: 403, RequestID: TRUNCATED, HostID: TRUNCATED, api error AccessDenied: User: arn:aws:sts::TRUNCATED:assumed-role/TRUNCATED-worker/TRUNCATED is not authorized to perform: s3:ListBucket on resource: \"arn:aws:s3:::TRUNCATED-prod-velero\" because no identity-based policy allows the s3:ListBucket action)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:178"
time="2024-09-23T20:45:30Z" level=info msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location id=200 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws
time="2024-09-23T20:46:20Z" level=error msg="Error listing backups in backup store" backupLocation=velero/default controller=backup-sync error="rpc error: code = Unknown desc = operation error S3: ListObjectsV2, https response error StatusCode: 403, RequestID: TRUNCATED, HostID: TRUNCATED, api error AccessDenied: User: arn:aws:sts::TRUNCATED:assumed-role/TRUNCATED-worker/TRUNCATED is not authorized to perform: s3:ListBucket on resource: \"arn:aws:s3:::TRUNCATED-prod-velero\" because no identity-based policy allows the s3:ListBucket action" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:351" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:109"
time="2024-09-23T20:46:20Z" level=info msg="plugin process exited" backupLocation=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-sync id=213 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws

Helm chart configuration:

resources:
  requests:
    cpu: 250m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 1.5Gi
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.10.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
configuration:
  extraEnvVars:
    GOMEMLIMIT: 1024MiB
  fsBackupTimeout: 480m
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: TRUNCATED-${attribute_aws_account_env}-velero
      default: true
      config:
        region: us-east-1
  volumeSnapshotLocation:
    - name: default
      provider: aws
      config:
        region: us-east-1
  # Set true for backup all pod volumes without having to apply annotation on the pod when used file system backup Default: false.
  defaultVolumesToFsBackup: false
backupsEnabled: true
snapshotsEnabled: true
deployNodeAgent: true
# credentials:
#   useSecret: true
#   existingSecret: velero-s3
serviceAccount:
  server:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: "${attribute_role_arn_velero}"
metrics:
  serviceMonitor:
    enabled: true
  prometheusRule:
    enabled: true
    spec:
      - alert: VeleroBackupPartialFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partialy failed backups.
        expr: |-
          velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning
      - alert: VeleroBackupFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failed backups.
        expr: |-
          velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning
nodeAgent:
  extraEnvVars:
    GOMEMLIMIT: 2048MiB
  resources:
    requests:
      cpu: 250m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
schedules:
  default-weekly:
    schedule: "0 3 * * 6"
    template:
      includedNamespaces:
        - "*"
      ttl: 1440h0m0s
      storageLocation: default
    useOwnerReferencesInBackup: false
  default-daily:
    schedule: "0 5 * * *"
    template:
      includedNamespaces:
        - "*"
      ttl: 168h0m0s
      storageLocation: default
    useOwnerReferencesInBackup: false

service account yaml, directly from the cluster, showing the appropriate annotation:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::TRUNCATED:role/TRUNCATED-velero
    meta.helm.sh/release-name: velero
    meta.helm.sh/release-namespace: velero
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-7.2.1
  name: velero-server
  namespace: velero
  # [...]

Given the above output, it looks like Velero is using the default role from IMDS/the ec2 worker role, not the IRSA role. Worth noting that prior to this version, we were on 1.10.x, and IRSA was working without issue. Looks like the switch to sdk-v2 has caused some issues.

May also be related to the following issues:

Environment:

  • Velero version (use velero version): 1.14.1
  • Velero features (use velero client config get features): n/a
  • Kubernetes version (use kubectl version): 1.24 (though based on how IRSA works, don't think the older version should be an issue).
  • Kubernetes installer & version: EKS
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): n/a

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants