-
Notifications
You must be signed in to change notification settings - Fork 271
Description
Describe the bug
The issue is potentially similar to #2103 but #2103 doesn't name the root cause so might be a separate bug.
We have observed the following:
- an ACK
Policywas created - ACK was preempted and thus moved to a different node/restarted
- after the restart, the policy got stuck with status
EntityAlreadyExists
Looking at the AWS Console, we see that the policy is there and it was created at May 23, 2025, 21:00 (UTC+02:00).
The k8s resource's creation timestamp is "2025-05-23T19:00:18Z".
Looking at the logs, we see that the creation timestamp coincides with the following error:
2025-05-23T19:00:19.808672231Z
error: RequestCanceled: request context canceled
It appears that the resource creation went through on the AWS side, and because of the cancelled context ACK IAM controller failed to update the status on the k8s resource, and thus after the restart it started treating it as a duplicate.
The k8s resource:
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Policy
metadata:
creationTimestamp: "2025-05-23T19:00:18Z"
finalizers:
- finalizers.iam.services.k8s.aws/Policy
generation: 1
labels:
<redacted>
name: <redacted>-policy
namespace: default
ownerReferences:
< ... >
resourceVersion: "22454815"
uid: da98fe89-6d62-4e6d-a8d7-db3a34a77f65
spec:
description: <redacted>
name: <redacted>-policy
path: /
policyDocument: <redacted>
tags:
< ... >
status:
ackResourceMetadata:
ownerAccountID: "<redacted>"
region: ap-east-1
conditions:
- message: "EntityAlreadyExists: A policy called <redacted>-policy already
exists. Duplicate names are not allowed.\n\tstatus code: 409, request id: <some id>"
status: "True"
type: ACK.Recoverable
- lastTransitionTime: "2025-05-26T13:16:49Z"
message: Unable to determine if desired resource state matches latest observed
state
reason: "EntityAlreadyExists: A policy called <redacted>-policy already
exists. Duplicate names are not allowed.\n\tstatus code: 409, request id: <some id>"
status: Unknown
type: ACK.ResourceSynced
Logs:
2025-05-23T19:00:19.804592193Z
message: Stopping and waiting for non leader election runnables
2025-05-23T19:00:19.804615114Z
message: Stopping and waiting for leader election runnables
2025-05-23T19:00:19.804622027Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804626056Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804629869Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804633073Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804636600Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804647902Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804652195Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804655897Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804660016Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804662889Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804764922Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804771343Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804774617Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804777556Z
message: Shutdown signal received, waiting for all workers to finish
2025-05-23T19:00:19.804780706Z
message: All workers finished
2025-05-23T19:00:19.804784236Z
message: All workers finished
2025-05-23T19:00:19.804795366Z
message: All workers finished
2025-05-23T19:00:19.804798865Z
message: All workers finished
2025-05-23T19:00:19.804802152Z
message: All workers finished
2025-05-23T19:00:19.804806649Z
message: All workers finished
2025-05-23T19:00:19.804815436Z
message: All workers finished
2025-05-23T19:00:19.804820649Z
message: All workers finished
2025-05-23T19:00:19.804824164Z
message: All workers finished
2025-05-23T19:00:19.804826969Z
message: All workers finished
2025-05-23T19:00:19.804852450Z
message: All workers finished
2025-05-23T19:00:19.804856040Z
message: All workers finished
2025-05-23T19:00:19.804859441Z
message: All workers finished
2025-05-23T19:00:19.808672231Z
error: RequestCanceled: request context canceled
caused by: context canceled
message: Reconciler error
stacktrace: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224
2025-05-23T19:00:19.808687782Z
message: All workers finished
2025-05-23T19:00:19.808692474Z
message: Stopping and waiting for caches
2025-05-23T19:00:19.808697464Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.OpenIDConnectProvider ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-05-23T19:00:19.808700987Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.AdoptedResource ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-05-23T19:00:19.808730730Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.Role ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-05-23T19:00:19.808734290Z
message: Stopping and waiting for webhooks
2025-05-23T19:00:19.808743872Z
message: Stopping and waiting for HTTP servers
2025-05-23T19:00:19.808747360Z
message: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1alpha1.Role ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2025-05-23T19:00:19.808751152Z
message: Shutting down metrics server with timeout of 1 minute
2025-05-23T19:00:19.808754639Z
message: shutting down server
2025-05-23T19:00:19.808757809Z
message: Wait completed, proceeding to shutdown the manager
2025-05-23T19:01:06.585749008Z
message: initializing service controller
2025-05-23T19:01:06.685908336Z
message: Waited for the caches to sync
2025-05-23T19:01:06.846057805Z
message: starting manager
2025-05-23T19:01:06.846070712Z
message: Starting metrics server
2025-05-23T19:01:06.846074574Z
message: Serving metrics server
2025-05-23T19:01:06.846080582Z
message: starting server
2025-05-23T19:01:06.846084893Z
message: Starting EventSource
2025-05-23T19:01:06.846088548Z
message: Starting Controller
2025-05-23T19:01:06.846091597Z
message: Starting EventSource
2025-05-23T19:01:06.846094416Z
message: Starting Controller
2025-05-23T19:01:06.846097980Z
message: Starting EventSource
2025-05-23T19:01:06.846108437Z
message: Starting Controller
2025-05-23T19:01:06.846111350Z
message: Starting EventSource
2025-05-23T19:01:06.846114339Z
message: Starting Controller
2025-05-23T19:01:06.846118235Z
message: Starting EventSource
2025-05-23T19:01:06.846121149Z
message: Starting Controller
2025-05-23T19:01:06.846123905Z
message: Starting EventSource
2025-05-23T19:01:06.846126881Z
message: Starting Controller
2025-05-23T19:01:06.846130687Z
message: Starting EventSource
2025-05-23T19:01:06.846140248Z
message: Starting Controller
2025-05-23T19:01:06.846143327Z
message: Starting EventSource
2025-05-23T19:01:06.846146197Z
message: Starting Controller
2025-05-23T19:01:06.846152109Z
message: Starting EventSource
2025-05-23T19:01:06.846155018Z
message: Starting Controller
2025-05-23T19:01:06.846158029Z
message: Starting EventSource
2025-05-23T19:01:06.846161087Z
message: Starting Controller
2025-05-23T19:01:06.846163992Z
message: Starting EventSource
2025-05-23T19:01:06.846166777Z
message: Starting Controller
2025-05-23T19:01:06.846169552Z
message: Starting EventSource
2025-05-23T19:01:06.846172291Z
message: Starting Controller
2025-05-23T19:01:06.846175026Z
message: Starting EventSource
2025-05-23T19:01:06.846177875Z
message: Starting Controller
2025-05-23T19:01:06.846180581Z
message: Starting EventSource
2025-05-23T19:01:06.846183370Z
message: Starting Controller
2025-05-23T19:01:07.099318069Z
message: Starting workers
2025-05-23T19:01:07.099337153Z
message: Starting workers
2025-05-23T19:01:07.105984047Z
message: Starting workers
2025-05-23T19:01:07.105994587Z
message: Starting workers
2025-05-23T19:01:07.105998261Z
message: Starting workers
2025-05-23T19:01:07.106001684Z
message: Starting workers
2025-05-23T19:01:07.106004811Z
message: Starting workers
2025-05-23T19:01:07.106008222Z
message: Starting workers
2025-05-23T19:01:07.106011207Z
message: Starting workers
2025-05-23T19:01:07.106014007Z
message: Starting workers
2025-05-23T19:01:07.106018778Z
message: Starting workers
2025-05-23T19:01:07.106021590Z
message: Starting workers
2025-05-23T19:01:07.150131516Z
message: Starting workers
2025-05-23T19:01:07.199319848Z
message: Starting workers
2025-05-23T19:01:13.247343524Z
error: EntityAlreadyExists: A policy called <redacted>-policy already exists. Duplicate names are not allowed.
status code: 409, request id: <some id>
message: Reconciler error
stacktrace: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224
2025-05-23T19:01:19.492725227Z
error: EntityAlreadyExists: A policy called <redacted>-policy already exists. Duplicate names are not allowed.
status code: 409, request id: <some id>
message: Reconciler error
stacktrace: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224
< .. the error above continues forever ...>
Steps to reproduce
This seems to be tricky but the attempt to create the resource must coincide with the pod restarts/cancelled context, so:
- create a resource
- restart the iam controller pod and somehow make sure that the reconciliation attempt gets aborted / context gets cancelled.
Expected outcome
IAM controller either doesn't create the resource in AWS at all and instead reconciles it after the pod restart, or manages the full resources reconcilication before shutting down, i.e. sets the correct status and ARN before shutting down.
Environment
-
ACK iam-controller version:
1.3.13 -
Kubernetes version:
v1.30.11-eks-bcf3d70 -
Using EKS (yes/no), if so version?
v1.30.11-eks-bcf3d70 -
AWS service targeted (S3, RDS, etc.):
IAM