Webhook not injecting Env Vars or Volumes/VolumeMounts on initial deployment to a new cluster #174

ahuffman · 2023-02-07T23:57:27Z

What happened:
Deploying multiple services on my cluster such as cluster-autoscaler, external-dns, ebs-csi-drivers. On initial deployment the pods do not receive the environment vars, volumes, and volumeMounts.

When I manually delete the affected pods after automated deployment, I get everything as expected from the webhook.

I've followed every possible AWS document on troubleshooting IRSA. I initially thought it could be a race condition post cluster instantiation, but I tested delaying the deployments as long as 10 minutes and the results are the same.

What you expected to happen:
Environment vars, volumes, and volumeMounts are injected into the deployment's pod specs without need for manual deletion of the pods.

How to reproduce it (as minimally and precisely as possible):
Create an EKS cluster, create an IAM OIDC provider, Create a IAM Policy, Create a Role, Attach the Policy to the Role, Create a trust relationship in the role referring to the OIDC provider and the Kubernetes service account, and finally do a helm release with appropriate values to specify the corresponding namespace, service account name, and annotations required for the roleARN to tie it all together.

Anything else we need to know?:
Not that I can think of, but feel free to ask for more :).

Environment:

AWS Region: us-east-1 (have tested in many with same result)
EKS Platform version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.3
Kubernetes version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.version): 1.24 (also tested 1.23 with same result)
Webhook Version: ? whatever comes with EKS 1.24.8

The text was updated successfully, but these errors were encountered:

mohammadasim · 2023-02-08T11:34:04Z

I am experiencing the same issue. Our cluster is a Kops managed cluster. We deployed a service that had two replicas. We noticed that one of the pod was able to access the s3 bucket but the other wasn't. On investigation, the pod that was not able to access the bucket didn't have AWS_ROLE_ARN AWS_WEB_IDENTITY_TOKEN_FILE environment variables. The pod that was able to access the s3 bucket had these environment variables. I checked both pods had the same service account. When I deleted the pod with the missing environment variables, the new pod created had these two environment variables.
My kubernetes version is v1.22.5. The amazon-eks-pod-identity-webhook version is as follows
Image: amazon/amazon-eks-pod-identity-webhook:latest Image ID: docker-pullable://amazon/amazon-eks-pod-identity-webhook@sha256:4a3ff337b6549dd29a06451945e40ba3385729c879f09f264d3e37d06cbf001a
Any information will be highly appreciated.

cjellick · 2023-03-03T14:59:35Z

This looks very similar to an issue we are hitting. Here was our conclusion (courtesy of @StrongMonkey):

When diving deep, this is because a race condition in aws-identity pod where serviceaccount and pod get created at the same time. It uses cache to fetch serviceaccount, which might not be ready when the pod was created.

@jsilverio22 should be able to share a WIP PR soon.

cjellick · 2023-03-03T15:19:13Z

In our case, we are probably exacerbating the problem by:

creating the deployment before we create the serviceAccount (Its programatic and very close together, so there isn't a long delay, but maybe just enough for a race)
Using this webhook in k3s, which might make it worse because in an HA setup in k3s watch events could get delayed 2 seconds

ahuffman · 2023-03-03T16:27:05Z

In my case, I'm running on EKS, however I'm performing my deployments via Helm charts, where the ServiceAccounts and related annotations are being deployed via the values at the same time (using the charts).

I also tried pre-provisioning the ServiceAccount with the annotations, but it did not change the behavior, and lead me to believe it was something else altogether.

In a similar situation to @cjellick , my entire cluster provisioning process is being done programmatically via Crossplane. Just to reiterate, I do not believe it's a problem with my configurations, because when I delete the pods after their first instantiation everything works fine, but the initial deployment of the pods do not pick up their IRSA privileges.

ekristen · 2023-09-25T19:17:33Z

I'm seeing this problem on simple re-deployments. Like a deployment recreates the pods for reasons (like moving nodes) and environment variables simply won't be put in place. I'm running the pod identity webhook on multiple nodes so there should always be one online to respond.

rlister · 2024-02-22T15:45:46Z

This is a blocker for us to adopt pod identity (as a switch from IRSA). We create IAM Role (using ACK controller), PodIdentityAssociation, ServiceAccount and Deployment in the same helm chart. On initial install, pods come up without the identity mutation, winning a race we want them to lose. This occurs even hardcoding roleARN in the PodIdentityAssociation, ie we are not waiting for status on the Role.

Installing PodIdentityAssociation in a helm pre-install hook does not help: the resource comes up and is ready very quickly, but our pods still come up before mutation is ready.

Waiting a minute and restarting Deployment gives us new pods with correct mutations.

johan-vandeweerd-hs · 2024-11-19T14:04:11Z

We are use using Argocd to deploy our services and are also looking at pod identity. Our services are deployed using a Helm chart and we added a PodIdentityAssociation custom resource to our Helm chart so the ACK EKS controller can create the association for us (the IAM role is created with Terraform before the service is deployed).

To try and solve the race condition, we added an Argocd preSync hook on the PodIdentityAssociation customer resource (source). But even with the preSync hook, the race condition still exists like @rlister mentioned.

In an effort to solve this issue, we added a customisation to Argocd and only mark the PodIdentityAssociation custom resource as Healthy when an associationId is filled out in the status field of the PodIdentityAssociation custom resource (source). With this in place, the pod identity association is created before the deployment is created but the necessary environment variables on the pods are still not injected and we still have to restart our pods for it to work.

We verified that the Argocd customisation works by downscaling the ACK EKS controller to zero before creating the Argocd application of the service. When the Argocd application of the service is added to Argocd, all the resources are out of sync and Argocd is waiting for the PodIdentityAssociation to be Healthy. Argocd shows the message waiting for completion of hook eks.services.k8s.aws/PodIdentityAssociation/<service-name> until the pod identity association is created. When ACK EKS controller is scaled up, the pod identity association is created and the rest of the Argocd application is synced but the pods still don't have the necessary environment variables injected until the pods are restarted.

taraspos · 2024-12-06T12:13:43Z

In our case, in a big EKS cluster where ServiceAccount, EKSPodIdentityAssociations and Deployment are created immediately one after another, we noticed this problem happening almost every time. Adding 5 seconds of sleep time before creating deployment resolved the issue in ~90% of cases but it still happens in some cases.

jsilverio22 mentioned this issue Mar 3, 2023

add check for SA to client if not found in cache #178

Open

maximethebault mentioned this issue Sep 29, 2023

Webhook stops injecting Env Vars or Volumes/VolumeMounts on previously working deployment #194

Open

mikkeloscar mentioned this issue Jul 5, 2024

eks: Use known ipv4 service CIDR range zalando-incubator/kubernetes-on-aws#7805

Merged

This was referenced Aug 13, 2024

[WIP] Fix race condition between service account availability and webhook invocation c445/amazon-eks-pod-identity-webhook#1

Closed

Fix race condition between service account availability and webhook invocation #236

Merged

This was referenced Sep 13, 2024

Refactor annotations parser #239

Open

Add SA lookup grace period pod annotations #240

Open

modulitos mentioned this issue Oct 21, 2024

fetch SAs from apiserver #242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook not injecting Env Vars or Volumes/VolumeMounts on initial deployment to a new cluster #174

Webhook not injecting Env Vars or Volumes/VolumeMounts on initial deployment to a new cluster #174

ahuffman commented Feb 7, 2023

mohammadasim commented Feb 8, 2023

cjellick commented Mar 3, 2023 •

edited

Loading

cjellick commented Mar 3, 2023

ahuffman commented Mar 3, 2023

ekristen commented Sep 25, 2023

rlister commented Feb 22, 2024

johan-vandeweerd-hs commented Nov 19, 2024

taraspos commented Dec 6, 2024

Webhook not injecting Env Vars or Volumes/VolumeMounts on initial deployment to a new cluster #174

Webhook not injecting Env Vars or Volumes/VolumeMounts on initial deployment to a new cluster #174

Comments

ahuffman commented Feb 7, 2023

mohammadasim commented Feb 8, 2023

cjellick commented Mar 3, 2023 • edited Loading

cjellick commented Mar 3, 2023

ahuffman commented Mar 3, 2023

ekristen commented Sep 25, 2023

rlister commented Feb 22, 2024

johan-vandeweerd-hs commented Nov 19, 2024

taraspos commented Dec 6, 2024

cjellick commented Mar 3, 2023 •

edited

Loading