1.14.2 causing container instances to grind to a halt #833

tysonmote · 2017-06-07T23:58:29Z

1.14.2, which was just released a few hours ago, appears to be causing major issues in one of our ECS clusters. We have a large number of instances running 1.14.1 and 1.14.2. Many of the instances running 1.14.2 are exhibiting an issue where the instance accepts as many tasks as possible but is never able to transition them from the "PENDING" state to the "RUNNING" state.

We've opened an AWS Support ticket as well but I wanted to post this here in case anyone else is encountering the same issue. We're digging through logs to see what's going on now.

tysonmote · 2017-06-08T00:55:30Z

Our first clue is that we see "Pulling container ... (NONE->RUNNING) serially" log lines but not "Finished pulling container" log lines for all of the affected tasks. This means that we're being blocked somewhere in this method:

amazon-ecs-agent/agent/engine/docker_task_engine.go

Lines 512 to 524 in 35acca3

    
           func (engine *DockerTaskEngine) serialPull(task *api.Task, container *api.Container) DockerContainerMetadata { 
        
           	seelog.Debugf("Attempting to obtain ImagePullDeleteLock to pull image - %s. Task: %v", container.Image, task) 
        
           	ImagePullDeleteLock.Lock() 
        
           	seelog.Debugf("Acquired ImagePullDeleteLock, start pulling image - %s. Task: %v", container.Image, task) 
        
           	defer seelog.Debugf("Released ImagePullDeleteLock after pulling image - %s. Task: %v", container.Image, task) 
        
           	defer ImagePullDeleteLock.Unlock() 
        
           	pullStart := time.Now() 
        
           	defer func(startTime time.Time) { 
        
           		seelog.Infof("Finished pulling container %v in %s. Task: %v", container.Image, time.Since(startTime).String(), task) 
        
           	}(pullStart) 
        
           	return engine.pullAndUpdateContainerReference(task, container) 
        
           }

Specifically, we're blocked on ImagePullDeleteLock.Lock(). To confirm this, we killed the ECS agent with the ABRT signal to get a full dump of all goroutines, which showed that we were blocked on that lock.

Fortunately restarting the ECS agent appears to fix the issue (tasks go from PENDING to RUNNING successfully), but the issue will likely just crop up again because it's a race / deadlock.

tejasmanohar · 2017-06-08T00:58:12Z

(I'm investigating this issue with @tysonmote)

See the stack of one of our (many) ECS agents facing the issue here.

nmeyerhans · 2017-06-08T01:12:56Z

Thanks for sending us the report and stacktraces. It's interesting that agent is reporting that it's pulling serially. What OS/AMI are you using? What docker version?

If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send me the archive directly at nmeyerha at amazon.com We'll also dig in to your stacktrace, which may provide enough detail on its own.

tysonmote · 2017-06-08T01:20:38Z

Here's some details about the machines we're running:

Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.10.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.308 GiB
...

We're using AMI CoreOS-stable-1068.10.0-hvm (ami-7d11c51d). I'll email you the logs from one of our instances that ran in to the problem, as well.

Thanks!

tejasmanohar · 2017-06-08T01:21:10Z

(Unfortunately, CoreOS is not supported by the log collector.)

Edit: my bad, there was previously an operating system check that would fail.

nmeyerhans · 2017-06-08T01:26:52Z

Thanks, I've received the logs.

tysonmote · 2017-06-08T01:29:10Z

@tejasmanohar I was able to run it. I've sent that log archive as well.

tysonmote · 2017-06-08T02:21:24Z

I just realized that I had a typo in the original issue message that said the problem was in 1.14.1, not 1.14.2. I fixed that.

samuelkarp · 2017-06-08T18:51:07Z

Hi @tysonmote and @tejasmanohar,

We've confirmed that a race condition exists in the ECS agent that can lead to a deadlock. ~~We believe that the race condition is more likely in 1.14.2, but exists in 1.14.1 as well.~~ We're actively working to fix this race condition and deadlock.

The specific race condition involves two locks that both relate to images: the ImagePullDeleteLock and imageManager.updateLock. The ImagePullDeleteLock is used for serializing image pulls (on older versions of Docker) and for ensuring that we don't delete an image at the same time we're pulling it (this is important for the task launch workflow, since we don't want to successfully pull the image and then fail to create the container because it has gone missing between when pull finished and container creation started). The imageManager.updateLock is used for synchronization of the internal data structure of the imageManager, which tracks images and selects images for deletion. When we're pulling an image, the ImagePullDeleteLock is acquired first for the actual pull and then the imageManager.updateLock is acquired when recording the metadata about the image. When we're deleting an image, the reverse lock acquisition order occurs. This mismatch in lock acquisition order can create a deadlock, since one goroutine can hold ImagePullDeleteLock and block while trying to acquire imageManager.updateLock while the other goroutine can hold imageManager.updateLock and block while trying to acquire ImagePullDeleteLock. This became significantly more likely to occur in 1.14.2 as the scope of imageManager.updateLock was modified to fix a different race condition involving concurrent writes to the imageManager's internal data structure (see #707 and #743).

We're working on addressing the lock acquisition order mismatch now and will keep this issue open until that fix is released.

Thanks,
Sam

samuelkarp · 2017-06-08T18:57:49Z

Correction: This race condition was introduced in 1.14.2. 1.14.1 does not have this deadlock, but does have a potential panic as described in #707.

tejasmanohar · 2017-06-08T19:12:02Z

@samuelkarp Got it, thanks! We'll stay on 1.14.1 for now. FWIW, if this is not going to be fixed ASAP, I'd recommend reverting the code that introduced the deadlock and re-releasing because a panic (that the agent is probably restarted on by process manager) is much, much better than a mysterious deadlock. Also, #834 may be helpful.

samuelkarp · 2017-06-08T19:14:01Z

@tejasmanohar Yep, we're exploring both options right now.

samuelkarp · 2017-06-08T22:30:14Z

amazon/amazon-ecs-agent:latest on DockerHub now points to 1.14.1 again. We have a fix forward in progress here: #836.

tejasmanohar · 2017-06-08T22:40:24Z

Good to hear. Thanks for acting quickly, @samuelkarp & team!

sergiofigueras · 2017-06-12T10:49:13Z

Hello @samuelkarp ,

Any ETA for that fix be available and new version is released?

Thanks!

vsiddharth · 2017-06-12T16:03:39Z

@sergiofigueras We already have a fix merged on the dev branch.
The release process is under way and the new version should be out as soon as our testing and release process is complete.

sharmaansh21 · 2017-06-13T00:00:19Z

I am still getting 1.14.2 while fetching latest

[ec2-user@ip-172-31-91-192 ~]$ docker run amazon/amazon-ecs-agent:latest version
2017-06-12T23:59:05Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.2 (35acca3)

nehalrp · 2017-06-13T00:04:10Z

@optimisticanshul You would need to re-pull it:

ec2-user@ip-172-30-83-142 ~ » docker pull amazon/amazon-ecs-agent:latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest
ec2-user@ip-172-30-83-142 ~ » docker run -it amazon/amazon-ecs-agent:latest                                                                                          
2017-06-13T00:01:46Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

sharmaansh21 · 2017-06-13T00:30:29Z

@nehalrp It is a new ec2 instance

masutaka · 2017-06-13T10:05:18Z

@optimisticanshul

I am still getting 1.14.2 while fetching latest

Also me too. So I've added the following workaround to userdata.

ECS_AGENT_VERSION=v1.14.1
docker pull amazon/amazon-ecs-agent:$ECS_AGENT_VERSION
docker tag amazon/amazon-ecs-agent:$ECS_AGENT_VERSION amazon/amazon-ecs-agent:latest

I've avoided automatically update ecs-agent.

vsiddharth · 2017-06-13T16:00:22Z

We have tagged v1.14.1 as latest on docker hub.

If you pull amazon/amazon-ecs-agent:latest you should get v1.14.1

docker pull amazon/amazon-ecs-agent                                                                                                                                                                                                    
Using default tag: latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest

docker run amazon/amazon-ecs-agent                                                                                                                                                                                                     
2017-06-13T15:57:32Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

ernestm · 2017-06-13T16:49:23Z

I'm also still getting v1.14.2 when I autoscale using the designated newest ECS AMI (ami-62745007 in us-east-2). Our ECS clusters autoscale all the time so we're not pulling the agent container manually ourselves. When the AMI starts, in ecs/ecs-init.log:

[root@ip-10-0-86-11 ecs]# more ecs-init.log
2017-06-13T14:30:03Z [INFO] pre-start
2017-06-13T14:30:03Z [INFO] Downloading Amazon EC2 Container Service Agent
2017-06-13T14:30:03Z [DEBUG] Downloading published md5sum from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar.md5

I guess whatever the AMI does isn't just "pull latest."

samuelkarp · 2017-06-13T16:51:41Z

I guess whatever the AMI does isn't just "pull latest."

That's correct. Each of the ECS-optimized AMIs has a specific version of the agent that is pre-cached. Similarly, when you install ecs-init on a regular Amazon Linux instance, it'll grab a specific version of the agent rather than "latest".

aws/amazon-ecs-agent#833

ernestm · 2017-06-13T18:34:42Z

Right... But on
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI_launch_latest.html and http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html it says this AMI should be getting me 1.14.1, and it's not not, so I'm not sure as an ECS AMI consumer what I'm supposed to do to downgrade agent versions. Though maybe some of these recent commits are for that?

aws/amazon-ecs-agent#833

vsiddharth · 2017-06-14T00:02:51Z

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

sharmaansh21 · 2017-06-29T04:30:48Z

@vsiddharth @samuelkarp even in 1.14.3 my tasks are stuck in pending state for hours.

vsiddharth · 2017-06-29T15:20:59Z

@optimisticanshul The issue discovered with agent versionv1.14.2 should be mitigated by v1.14.3. Would it be possible for you to share the logs for further investigation ?

Please have a look here for the logs we require.

sharmaansh21 · 2017-06-30T01:17:32Z

Definitely where i send them ?

sharmaansh21 · 2017-06-30T04:36:01Z

Steps to reproduce:

ECS Cluster 2 instances.
Deploy a sample service.
Redeploy

In ecs agent log

./ecs-agent.log.2017-06-30-03:2017-06-30T03:55:43Z [INFO] Stopping container module="TaskEngine" task="prod-whatever:108 arn:aws:ecs:us-east-1:ACCOUNT-ID:task/1730060e-66e6-4795-ac20-551f4d09dd7a, Status: (RUNNING->STOPPED) Containers: [prod-whatever (RUNNING->STOPPED),]" container="prod-whatever(ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME) (RUNNING->STOPPED)"

But output of docker ps

274ba55bfca5        ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME   "/bin/sh -c 'bundl..."   25 minutes ago      Up 25 minutes       80/tcp, 0.0.0.0:10283->8080/tcp   ecs-prod-whatever-108-prod-whatever-feb2cdae8894f389d001

See new tasks are stuck in pending state. If i manually run docker stop 274ba55bfca5 it deploys new tasks.

I am not sure whether it's related to this issue or something else.

vsiddharth · 2017-07-05T17:45:30Z

@optimisticanshul you could send them over to sidvin at amazon.com.

vsiddharth added kind/bug scope/ECS Agent labels Jun 8, 2017

richardpen mentioned this issue Jun 8, 2017

Fix the deadlock caused by the ImagePullDeleteLock #836

Merged

8 tasks

samuelkarp added this to the 1.14.3 milestone Jun 8, 2017

samuelkarp added the pending release label Jun 9, 2017

vsiddharth mentioned this issue Jun 9, 2017

Filter tasks returned by introspection API by short container ID #770

Closed

samuelkarp mentioned this issue Jun 9, 2017

ubuntu trusty: Updated to Go 1.7 aws/amazon-ecs-init#100

Closed

samuelkarp mentioned this issue Jun 13, 2017

Agent Running But Disconnected #781

Closed

MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017

Revert AMI images

650c37a

aws/amazon-ecs-agent#833

MiguelMoll mentioned this issue Jun 13, 2017

Revert AMI images convox/rack#2224

Closed

MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017

Revert AMI images

357da02

aws/amazon-ecs-agent#833

MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017

Revert AMI images

97b2151

aws/amazon-ecs-agent#833

MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017

Revert AMI images

3a3d3bd

aws/amazon-ecs-agent#833

samuelkarp closed this as completed Jun 14, 2017

samuelkarp removed the pending release label Jun 14, 2017

vsiddharth mentioned this issue Jun 16, 2017

Task stuck in pending state #731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.14.2 causing container instances to grind to a halt #833

1.14.2 causing container instances to grind to a halt #833

tysonmote commented Jun 7, 2017 •

edited

Loading

tysonmote commented Jun 8, 2017 •

edited

Loading

tejasmanohar commented Jun 8, 2017

nmeyerhans commented Jun 8, 2017

tysonmote commented Jun 8, 2017

tejasmanohar commented Jun 8, 2017 •

edited

Loading

nmeyerhans commented Jun 8, 2017

tysonmote commented Jun 8, 2017

tysonmote commented Jun 8, 2017

samuelkarp commented Jun 8, 2017 •

edited

Loading

samuelkarp commented Jun 8, 2017

tejasmanohar commented Jun 8, 2017 •

edited

Loading

samuelkarp commented Jun 8, 2017

samuelkarp commented Jun 8, 2017

tejasmanohar commented Jun 8, 2017

sergiofigueras commented Jun 12, 2017

vsiddharth commented Jun 12, 2017

sharmaansh21 commented Jun 13, 2017

nehalrp commented Jun 13, 2017 •

edited

Loading

sharmaansh21 commented Jun 13, 2017

masutaka commented Jun 13, 2017

vsiddharth commented Jun 13, 2017

ernestm commented Jun 13, 2017 •

edited

Loading

samuelkarp commented Jun 13, 2017

ernestm commented Jun 13, 2017

vsiddharth commented Jun 14, 2017

sharmaansh21 commented Jun 29, 2017 •

edited

Loading

vsiddharth commented Jun 29, 2017

sharmaansh21 commented Jun 30, 2017

sharmaansh21 commented Jun 30, 2017 •

edited

Loading

vsiddharth commented Jul 5, 2017

1.14.2 causing container instances to grind to a halt #833

1.14.2 causing container instances to grind to a halt #833

Comments

tysonmote commented Jun 7, 2017 • edited Loading

tysonmote commented Jun 8, 2017 • edited Loading

tejasmanohar commented Jun 8, 2017

nmeyerhans commented Jun 8, 2017

tysonmote commented Jun 8, 2017

tejasmanohar commented Jun 8, 2017 • edited Loading

nmeyerhans commented Jun 8, 2017

tysonmote commented Jun 8, 2017

tysonmote commented Jun 8, 2017

samuelkarp commented Jun 8, 2017 • edited Loading

samuelkarp commented Jun 8, 2017

tejasmanohar commented Jun 8, 2017 • edited Loading

samuelkarp commented Jun 8, 2017

samuelkarp commented Jun 8, 2017

tejasmanohar commented Jun 8, 2017

sergiofigueras commented Jun 12, 2017

vsiddharth commented Jun 12, 2017

sharmaansh21 commented Jun 13, 2017

nehalrp commented Jun 13, 2017 • edited Loading

sharmaansh21 commented Jun 13, 2017

masutaka commented Jun 13, 2017

vsiddharth commented Jun 13, 2017

ernestm commented Jun 13, 2017 • edited Loading

samuelkarp commented Jun 13, 2017

ernestm commented Jun 13, 2017

vsiddharth commented Jun 14, 2017

sharmaansh21 commented Jun 29, 2017 • edited Loading

vsiddharth commented Jun 29, 2017

sharmaansh21 commented Jun 30, 2017

sharmaansh21 commented Jun 30, 2017 • edited Loading

vsiddharth commented Jul 5, 2017

tysonmote commented Jun 7, 2017 •

edited

Loading

tysonmote commented Jun 8, 2017 •

edited

Loading

tejasmanohar commented Jun 8, 2017 •

edited

Loading

samuelkarp commented Jun 8, 2017 •

edited

Loading

tejasmanohar commented Jun 8, 2017 •

edited

Loading

nehalrp commented Jun 13, 2017 •

edited

Loading

ernestm commented Jun 13, 2017 •

edited

Loading

sharmaansh21 commented Jun 29, 2017 •

edited

Loading

sharmaansh21 commented Jun 30, 2017 •

edited

Loading