-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.14.2 causing container instances to grind to a halt #833
Comments
Our first clue is that we see "Pulling container ... (NONE->RUNNING) serially" log lines but not "Finished pulling container" log lines for all of the affected tasks. This means that we're being blocked somewhere in this method: amazon-ecs-agent/agent/engine/docker_task_engine.go Lines 512 to 524 in 35acca3
Specifically, we're blocked on Fortunately restarting the ECS agent appears to fix the issue (tasks go from PENDING to RUNNING successfully), but the issue will likely just crop up again because it's a race / deadlock. |
(I'm investigating this issue with @tysonmote) See the stack of one of our (many) ECS agents facing the issue here. |
Thanks for sending us the report and stacktraces. It's interesting that agent is reporting that it's pulling serially. What OS/AMI are you using? What docker version? If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send me the archive directly at nmeyerha at amazon.com We'll also dig in to your stacktrace, which may provide enough detail on its own. |
Here's some details about the machines we're running:
We're using AMI Thanks! |
(Unfortunately, CoreOS is not supported by the log collector.) Edit: my bad, there was previously an operating system check that would fail. |
Thanks, I've received the logs. |
@tejasmanohar I was able to run it. I've sent that log archive as well. |
I just realized that I had a typo in the original issue message that said the problem was in 1.14.1, not 1.14.2. I fixed that. |
Hi @tysonmote and @tejasmanohar, We've confirmed that a race condition exists in the ECS agent that can lead to a deadlock. The specific race condition involves two locks that both relate to images: the We're working on addressing the lock acquisition order mismatch now and will keep this issue open until that fix is released. Thanks, |
Correction: This race condition was introduced in 1.14.2. 1.14.1 does not have this deadlock, but does have a potential panic as described in #707. |
@samuelkarp Got it, thanks! We'll stay on 1.14.1 for now. FWIW, if this is not going to be fixed ASAP, I'd recommend reverting the code that introduced the deadlock and re-releasing because a panic (that the agent is probably restarted on by process manager) is much, much better than a mysterious deadlock. Also, #834 may be helpful. |
@tejasmanohar Yep, we're exploring both options right now. |
|
Good to hear. Thanks for acting quickly, @samuelkarp & team! |
Hello @samuelkarp , Any ETA for that fix be available and new version is released? Thanks! |
@sergiofigueras We already have a fix merged on the |
I am still getting 1.14.2 while fetching latest
|
@optimisticanshul You would need to re-pull it:
|
@nehalrp It is a new ec2 instance |
@optimisticanshul
Also me too. So I've added the following workaround to userdata.
I've avoided automatically update ecs-agent. |
We have tagged If you pull
|
I'm also still getting v1.14.2 when I autoscale using the designated newest ECS AMI (ami-62745007 in us-east-2). Our ECS clusters autoscale all the time so we're not pulling the agent container manually ourselves. When the AMI starts, in ecs/ecs-init.log: [root@ip-10-0-86-11 ecs]# more ecs-init.log I guess whatever the AMI does isn't just "pull latest." |
That's correct. Each of the ECS-optimized AMIs has a specific version of the agent that is pre-cached. Similarly, when you install ecs-init on a regular Amazon Linux instance, it'll grab a specific version of the agent rather than "latest". |
Right... But on |
We have now published a new AMI that includes the latest agent release. The new AMIs can be found here |
@vsiddharth @samuelkarp even in 1.14.3 my tasks are stuck in pending state for hours. |
@optimisticanshul The issue discovered with agent version Please have a look here for the logs we require. |
Definitely where i send them ? |
Steps to reproduce:
In ecs agent log
But output of docker ps
See new tasks are stuck in pending state. If i manually run docker stop 274ba55bfca5 it deploys new tasks. I am not sure whether it's related to this issue or something else. |
@optimisticanshul you could send them over to sidvin at amazon.com. |
1.14.2, which was just released a few hours ago, appears to be causing major issues in one of our ECS clusters. We have a large number of instances running 1.14.1 and 1.14.2. Many of the instances running 1.14.2 are exhibiting an issue where the instance accepts as many tasks as possible but is never able to transition them from the "PENDING" state to the "RUNNING" state.
We've opened an AWS Support ticket as well but I wanted to post this here in case anyone else is encountering the same issue. We're digging through logs to see what's going on now.
The text was updated successfully, but these errors were encountered: