Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThrottlingException #498

Closed
bobzoller opened this issue Aug 19, 2016 · 5 comments
Closed

ThrottlingException #498

bobzoller opened this issue Aug 19, 2016 · 5 comments

Comments

@bobzoller
Copy link

I have a scheduled Lambda that creates a Task daily at 5am PST. Most days, the ecs-agent attempts to pull the image but fails with a ThrottlingException. Occasionally, it succeeds.

When it fails, since this is a one-off task, it means my job does not try again until 24 hours later.

It seems like the ecs-agent can afford to retry more times to pull the image in this situation. I'm not well versed in this codebase, but it seems the agent currently uses the default retryer (eg 3 attempts).

2016-08-18T12:00:03Z [INFO] Error while pulling container; will try to run anyways module="TaskEngine" task="convox-it-accounts-api-run:12 arn:aws:ecs:us-east-1:XXX:task/8b258c32-d1f1-47af-bc0c-311811a70307, Status: (NONE->RUNNING) Containers: [run (PULLED->RUNNING),]" err="ThrottlingException: Rate exceeded
2016-08-18T12:00:03Z [INFO] Adding event module="eventhandler" change="ContainerChange: arn:aws:ecs:us-east-1:XXX:task/8b258c32-d1f1-47af-bc0c-311811a70307 run -> STOPPED, Reason CannotPullECRContainerError: ThrottlingException: Rate exceeded
2016-08-18T12:00:03Z [INFO] Sending container change module="eventhandler" event="ContainerChange: arn:aws:ecs:us-east-1:XXX:task/8b258c32-d1f1-47af-bc0c-311811a70307 run -> STOPPED, Reason CannotPullECRContainerError: ThrottlingException: Rate exceeded
status code: 400, request id: 4bf71aa8-653b-11e6-8de5-4b00e2e3d288, Known Sent: NONE" change="ContainerChange: arn:aws:ecs:us-east-1:XXX:task/8b258c32-d1f1-47af-bc0c-311811a70307 run -> STOPPED, Reason CannotPullECRContainerError: ThrottlingException: Rate exceeded
@jbergknoff
Copy link

We're also experiencing ThrottlingException pulling images from ECR when the cluster gets a burst of work scheduled. Just now I had a container take about 5 minutes to start, but I'm not sure if the ECR rate limiting is related.

Agent logs

2016-08-23T16:00:05Z [INFO] Pulling container module="TaskEngine" task="<taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (NONE->RUNNING) Containers: [<taskname> (NONE->RUNNING),]" container="<taskname>(<ECR URL>) (NONE->RUNNING)"
2016-08-23T16:00:05Z [INFO] Error transitioning container module="TaskEngine" task="<taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (NONE->RUNNING) Containers: [<taskname> (NONE->RUNNING),]" container="<taskname>(<ECR URL>) (NONE->RUNNING)" state="PULLED"
2016-08-23T16:00:05Z [INFO] Error while pulling container; will try to run anyways module="TaskEngine" task="<taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (NONE->RUNNING) Containers: [<taskname> (PULLED->RUNNING),]" err="ThrottlingException: Rate exceeded
2016-08-23T16:00:05Z [INFO] Creating container module="TaskEngine" task="<taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (NONE->RUNNING) Containers: [<taskname> (PULLED->RUNNING),]" container="<taskname>(<ECR URL>) (PULLED->RUNNING)"
2016-08-23T16:00:05Z [INFO] Created container name mapping for task <taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (NONE->RUNNING) Containers: [<taskname> (PULLED->RUNNING),] - <taskname>(<ECR URL>) (PULLED->RUNNING) -> ecs-<taskname>-27-<taskname>-d68c9afab1f38ec8d601
2016-08-23T16:00:11Z [INFO] Created docker container for task <taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (CREATED->RUNNING) Containers: [<taskname> (CREATED->RUNNING),]: <taskname>(<ECR URL>) (CREATED->RUNNING) -> 3e198d5694219657b83fb773d7a36c328ef2293745b16c7bdedece557adc2f12
2016-08-23T16:00:11Z [INFO] Starting container module="TaskEngine" task="<taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (CREATED->RUNNING) Containers: [<taskname> (CREATED->RUNNING),]" container="<taskname>(<ECR URL>) (CREATED->RUNNING)"
2016-08-23T16:00:11Z [INFO] Redundant container state change for task <taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (CREATED->RUNNING) Containers: [<taskname> (CREATED->RUNNING),]: <taskname>(<ECR URL>) (CREATED->RUNNING) to CREATED, but already CREATED
2016-08-23T16:00:15Z [INFO] Task change event module="TaskEngine" event="{TaskArn:arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b Status:RUNNING Reason: SentStatus:NONE}"
2016-08-23T16:00:15Z [INFO] Adding event module="eventhandler" change="ContainerChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b <taskname> -> RUNNING, Reason CannotPullContainerError: ThrottlingException: Rate exceeded
2016-08-23T16:00:15Z [INFO] Adding event module="eventhandler" change="TaskChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b -> RUNNING, Known Sent: NONE"
2016-08-23T16:00:15Z [INFO] Sending container change module="eventhandler" event="ContainerChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b <taskname> -> RUNNING, Reason CannotPullContainerError: ThrottlingException: Rate exceeded
 status code: 400, request id: a88e19dc-694a-11e6-9c01-fd995f11b465, Known Sent: NONE" change="ContainerChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b <taskname> -> RUNNING, Reason CannotPullContainerError: ThrottlingException: Rate exceeded
2016-08-23T16:00:15Z [INFO] Redundant container state change for task <taskname>:27 arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b, Status: (RUNNING->RUNNING) Containers: [<taskname> (RUNNING->RUNNING),]: <taskname>(<ECR URL>) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-08-23T16:00:15Z [INFO] Sending task change module="eventhandler" event="TaskChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b -> RUNNING, Known Sent: NONE" change="TaskChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b -> RUNNING, Known Sent: NONE"
2016-08-23T16:00:55Z [WARN] Error retrieving stats for container 3e198d5694219657b83fb773d7a36c328ef2293745b16c7bdedece557adc2f12: inactivity time exceeded timeout
2016-08-23T16:02:35Z [WARN] Error retrieving stats for container 3e198d5694219657b83fb773d7a36c328ef2293745b16c7bdedece557adc2f12: inactivity time exceeded timeout
2016-08-23T16:02:40Z [WARN] Error retrieving stats for container 3e198d5694219657b83fb773d7a36c328ef2293745b16c7bdedece557adc2f12: inactivity time exceeded timeout
2016-08-23T16:04:56Z [INFO] Task change event module="TaskEngine" event="{TaskArn:arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b Status:STOPPED Reason: SentStatus:RUNNING}"
2016-08-23T16:04:56Z [INFO] Adding event module="eventhandler" change="ContainerChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b <taskname> -> STOPPED, Exit 0, , Reason CannotPullContainerError: ThrottlingException: Rate exceeded
2016-08-23T16:04:56Z [INFO] Adding event module="eventhandler" change="TaskChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b -> STOPPED, Known Sent: RUNNING"
2016-08-23T16:04:56Z [INFO] Sending container change module="eventhandler" event="ContainerChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b <taskname> -> STOPPED, Exit 0, , Reason CannotPullContainerError: ThrottlingException: Rate exceeded
 status code: 400, request id: a88e19dc-694a-11e6-9c01-fd995f11b465, Known Sent: RUNNING" change="ContainerChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b <taskname> -> STOPPED, Exit 0, , Reason CannotPullContainerError: ThrottlingException: Rate exceeded
2016-08-23T16:04:56Z [INFO] Sending task change module="eventhandler" event="TaskChange: arn:aws:ecs:us-east-1:<account>:task/
b1edeeba-b2dc-4d89-94b4-705e41750d3b -> STOPPED, Known Sent: RUNNING" change="TaskChange: arn:aws:ecs:us-east-1:<account>:task/b1edeeba-b2dc-4d89-94b4-705e41750d3b -> STOPPED, Known Sent: RUNNING"
2016-08-23T16:04:56Z [WARN] Error retrieving stats for container 3e198d5694219657b83fb773d7a36c328ef2293745b16c7bdedece557adc2f12: io: read/write on closed pipe

The container started actually running at 16:04:51 and then exited cleanly a few seconds afterwards (though the container didn't get cleaned up until several minuter later). I wonder if the instance's CPU was just saturated. Thoughts?

@bobzoller
Copy link
Author

FWIW, response from the ECS team on an AWS support ticket:

"2016-08-25T12:00:04Z [INFO] Adding event module="eventhandler" change="ContainerChange: arn:aws:ecs:us-east-1:XXX:task/2b5f1779-0765-4d87-8f11-5ee8e29f657a run -> STOPPED, Reason CannotPullECRContainerError: ThrottlingException: Rate exceeded
status code: 400, request id: 75567eb3-6abb-11e6-9d4c-cb4ccd1cee97, Known Sent: NONE"

Is related to the GetAuthorizationToken call in the ECS agent. The current TPS limit is 1 TPS with 30 TPS burst for fetching a new token. This isn't a limit that can be changed at the moment, but we're looking in making improvements so we can increase it, as well as cache tokens in the ECS agent.

We would recommend adding backoff and retry in your run task from the lambda call for the moment.

@kenny-house
Copy link

We've started seeing these errors very frequently now, whereas I don't believe we had ever experienced them previously. I recently changed our cluster to use the most recent ECS optimized AMI with the 1.12.1 agent (I believe we were few versions back previously).

Even launching tasks with 2 containers will frequently fail, as one of the container images fails to pull.

@bobzoller
Copy link
Author

FYI it looks like there's a potential fix coming down the pike in #523

@kiranmeduri
Copy link
Contributor

This issue should be fixed by #523 when using agent v1.13.0. Please re-open if you are still seeing this. Thanks

fierlion pushed a commit to fierlion/amazon-ecs-agent that referenced this issue May 31, 2022
…h0 if none can be found (aws#498)

check ipv4 routes for default network interface, only fall back to eth0 if none can be found
rsheik29 pushed a commit to rsheik29/amazon-ecs-agent that referenced this issue Jul 11, 2022
…h0 if none can be found (aws#498)

check ipv4 routes for default network interface, only fall back to eth0 if none can be found
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants