Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.14.2 causing container instances to grind to a halt #833

Closed
tysonmote opened this issue Jun 7, 2017 · 30 comments
Closed

1.14.2 causing container instances to grind to a halt #833

tysonmote opened this issue Jun 7, 2017 · 30 comments

Comments

@tysonmote
Copy link

tysonmote commented Jun 7, 2017

1.14.2, which was just released a few hours ago, appears to be causing major issues in one of our ECS clusters. We have a large number of instances running 1.14.1 and 1.14.2. Many of the instances running 1.14.2 are exhibiting an issue where the instance accepts as many tasks as possible but is never able to transition them from the "PENDING" state to the "RUNNING" state.

We've opened an AWS Support ticket as well but I wanted to post this here in case anyone else is encountering the same issue. We're digging through logs to see what's going on now.

@tysonmote
Copy link
Author

tysonmote commented Jun 8, 2017

Our first clue is that we see "Pulling container ... (NONE->RUNNING) serially" log lines but not "Finished pulling container" log lines for all of the affected tasks. This means that we're being blocked somewhere in this method:

func (engine *DockerTaskEngine) serialPull(task *api.Task, container *api.Container) DockerContainerMetadata {
seelog.Debugf("Attempting to obtain ImagePullDeleteLock to pull image - %s. Task: %v", container.Image, task)
ImagePullDeleteLock.Lock()
seelog.Debugf("Acquired ImagePullDeleteLock, start pulling image - %s. Task: %v", container.Image, task)
defer seelog.Debugf("Released ImagePullDeleteLock after pulling image - %s. Task: %v", container.Image, task)
defer ImagePullDeleteLock.Unlock()
pullStart := time.Now()
defer func(startTime time.Time) {
seelog.Infof("Finished pulling container %v in %s. Task: %v", container.Image, time.Since(startTime).String(), task)
}(pullStart)
return engine.pullAndUpdateContainerReference(task, container)
}

Specifically, we're blocked on ImagePullDeleteLock.Lock(). To confirm this, we killed the ECS agent with the ABRT signal to get a full dump of all goroutines, which showed that we were blocked on that lock.

Fortunately restarting the ECS agent appears to fix the issue (tasks go from PENDING to RUNNING successfully), but the issue will likely just crop up again because it's a race / deadlock.

@tejasmanohar
Copy link

(I'm investigating this issue with @tysonmote)

See the stack of one of our (many) ECS agents facing the issue here.

@nmeyerhans
Copy link
Contributor

Thanks for sending us the report and stacktraces. It's interesting that agent is reporting that it's pulling serially. What OS/AMI are you using? What docker version?

If you're able to run the ECS log collector and send me the resulting archive, we may be able to learn more from the logs. You can send me the archive directly at nmeyerha at amazon.com We'll also dig in to your stacktrace, which may provide enough detail on its own.

@tysonmote
Copy link
Author

Here's some details about the machines we're running:

Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.10.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.308 GiB
...

We're using AMI CoreOS-stable-1068.10.0-hvm (ami-7d11c51d). I'll email you the logs from one of our instances that ran in to the problem, as well.

Thanks!

@tejasmanohar
Copy link

tejasmanohar commented Jun 8, 2017

(Unfortunately, CoreOS is not supported by the log collector.)

Edit: my bad, there was previously an operating system check that would fail.

@nmeyerhans
Copy link
Contributor

Thanks, I've received the logs.

@tysonmote
Copy link
Author

@tejasmanohar I was able to run it. I've sent that log archive as well.

@tysonmote
Copy link
Author

I just realized that I had a typo in the original issue message that said the problem was in 1.14.1, not 1.14.2. I fixed that.

@samuelkarp
Copy link
Contributor

samuelkarp commented Jun 8, 2017

Hi @tysonmote and @tejasmanohar,

We've confirmed that a race condition exists in the ECS agent that can lead to a deadlock. We believe that the race condition is more likely in 1.14.2, but exists in 1.14.1 as well. We're actively working to fix this race condition and deadlock.

The specific race condition involves two locks that both relate to images: the ImagePullDeleteLock and imageManager.updateLock. The ImagePullDeleteLock is used for serializing image pulls (on older versions of Docker) and for ensuring that we don't delete an image at the same time we're pulling it (this is important for the task launch workflow, since we don't want to successfully pull the image and then fail to create the container because it has gone missing between when pull finished and container creation started). The imageManager.updateLock is used for synchronization of the internal data structure of the imageManager, which tracks images and selects images for deletion. When we're pulling an image, the ImagePullDeleteLock is acquired first for the actual pull and then the imageManager.updateLock is acquired when recording the metadata about the image. When we're deleting an image, the reverse lock acquisition order occurs. This mismatch in lock acquisition order can create a deadlock, since one goroutine can hold ImagePullDeleteLock and block while trying to acquire imageManager.updateLock while the other goroutine can hold imageManager.updateLock and block while trying to acquire ImagePullDeleteLock. This became significantly more likely to occur in 1.14.2 as the scope of imageManager.updateLock was modified to fix a different race condition involving concurrent writes to the imageManager's internal data structure (see #707 and #743).

We're working on addressing the lock acquisition order mismatch now and will keep this issue open until that fix is released.

Thanks,
Sam

@samuelkarp
Copy link
Contributor

Correction: This race condition was introduced in 1.14.2. 1.14.1 does not have this deadlock, but does have a potential panic as described in #707.

@tejasmanohar
Copy link

tejasmanohar commented Jun 8, 2017

@samuelkarp Got it, thanks! We'll stay on 1.14.1 for now. FWIW, if this is not going to be fixed ASAP, I'd recommend reverting the code that introduced the deadlock and re-releasing because a panic (that the agent is probably restarted on by process manager) is much, much better than a mysterious deadlock. Also, #834 may be helpful.

@samuelkarp
Copy link
Contributor

@tejasmanohar Yep, we're exploring both options right now.

@samuelkarp
Copy link
Contributor

amazon/amazon-ecs-agent:latest on DockerHub now points to 1.14.1 again. We have a fix forward in progress here: #836.

@tejasmanohar
Copy link

Good to hear. Thanks for acting quickly, @samuelkarp & team!

@sergiofigueras
Copy link

Hello @samuelkarp ,

Any ETA for that fix be available and new version is released?

Thanks!

@vsiddharth
Copy link
Contributor

@sergiofigueras We already have a fix merged on the dev branch.
The release process is under way and the new version should be out as soon as our testing and release process is complete.

@sharmaansh21
Copy link

I am still getting 1.14.2 while fetching latest

[ec2-user@ip-172-31-91-192 ~]$ docker run amazon/amazon-ecs-agent:latest version
2017-06-12T23:59:05Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.2 (35acca3)

@nehalrp
Copy link

nehalrp commented Jun 13, 2017

@optimisticanshul You would need to re-pull it:

ec2-user@ip-172-30-83-142 ~ » docker pull amazon/amazon-ecs-agent:latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest
ec2-user@ip-172-30-83-142 ~ » docker run -it amazon/amazon-ecs-agent:latest                                                                                          
2017-06-13T00:01:46Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

@sharmaansh21
Copy link

@nehalrp It is a new ec2 instance

@masutaka
Copy link

@optimisticanshul

I am still getting 1.14.2 while fetching latest

Also me too. So I've added the following workaround to userdata.

ECS_AGENT_VERSION=v1.14.1
docker pull amazon/amazon-ecs-agent:$ECS_AGENT_VERSION
docker tag amazon/amazon-ecs-agent:$ECS_AGENT_VERSION amazon/amazon-ecs-agent:latest

I've avoided automatically update ecs-agent.

@vsiddharth
Copy link
Contributor

We have tagged v1.14.1 as latest on docker hub.

If you pull amazon/amazon-ecs-agent:latest you should get v1.14.1

docker pull amazon/amazon-ecs-agent                                                                                                                                                                                                    
Using default tag: latest
latest: Pulling from amazon/amazon-ecs-agent
202850a75283: Pull complete
572682c3a733: Pull complete
896c431e2a43: Pull complete
559716f174f2: Pull complete
Digest: sha256:8d4e316e0280402a2c666492e2e454506966c776c635408634496c9c2b012465
Status: Downloaded newer image for amazon/amazon-ecs-agent:latest

docker run amazon/amazon-ecs-agent                                                                                                                                                                                                     
2017-06-13T15:57:32Z [INFO] Starting Agent: Amazon ECS Agent - v1.14.1 (467c3d7)

@ernestm
Copy link

ernestm commented Jun 13, 2017

I'm also still getting v1.14.2 when I autoscale using the designated newest ECS AMI (ami-62745007 in us-east-2). Our ECS clusters autoscale all the time so we're not pulling the agent container manually ourselves. When the AMI starts, in ecs/ecs-init.log:

[root@ip-10-0-86-11 ecs]# more ecs-init.log
2017-06-13T14:30:03Z [INFO] pre-start
2017-06-13T14:30:03Z [INFO] Downloading Amazon EC2 Container Service Agent
2017-06-13T14:30:03Z [DEBUG] Downloading published md5sum from https://s3.amazonaws.com/amazon-ecs-agent/ecs-agent-v1.14.2.tar.md5

I guess whatever the AMI does isn't just "pull latest."

@samuelkarp
Copy link
Contributor

I guess whatever the AMI does isn't just "pull latest."

That's correct. Each of the ECS-optimized AMIs has a specific version of the agent that is pre-cached. Similarly, when you install ecs-init on a regular Amazon Linux instance, it'll grab a specific version of the agent rather than "latest".

MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017
MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017
@ernestm
Copy link

ernestm commented Jun 13, 2017

Right... But on
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI_launch_latest.html and http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html it says this AMI should be getting me 1.14.1, and it's not not, so I'm not sure as an ECS AMI consumer what I'm supposed to do to downgrade agent versions. Though maybe some of these recent commits are for that?

MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017
MiguelMoll pushed a commit to convox/rack that referenced this issue Jun 13, 2017
@vsiddharth
Copy link
Contributor

We have now published a new AMI that includes the latest agent release.
The latest agent includes fixes to resolve the reported deadlock

The new AMIs can be found here

@sharmaansh21
Copy link

sharmaansh21 commented Jun 29, 2017

@vsiddharth @samuelkarp even in 1.14.3 my tasks are stuck in pending state for hours.

@vsiddharth
Copy link
Contributor

@optimisticanshul The issue discovered with agent versionv1.14.2 should be mitigated by v1.14.3. Would it be possible for you to share the logs for further investigation ?

Please have a look here for the logs we require.

@sharmaansh21
Copy link

Definitely where i send them ?

@sharmaansh21
Copy link

sharmaansh21 commented Jun 30, 2017

Steps to reproduce:

  1. ECS Cluster 2 instances.
  2. Deploy a sample service.
  3. Redeploy

In ecs agent log

./ecs-agent.log.2017-06-30-03:2017-06-30T03:55:43Z [INFO] Stopping container module="TaskEngine" task="prod-whatever:108 arn:aws:ecs:us-east-1:ACCOUNT-ID:task/1730060e-66e6-4795-ac20-551f4d09dd7a, Status: (RUNNING->STOPPED) Containers: [prod-whatever (RUNNING->STOPPED),]" container="prod-whatever(ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME) (RUNNING->STOPPED)"

But output of docker ps

274ba55bfca5        ACCOUNT-ID.dkr.ecr.us-east-1.amazonaws.com/whatever:IMAGE_NAME   "/bin/sh -c 'bundl..."   25 minutes ago      Up 25 minutes       80/tcp, 0.0.0.0:10283->8080/tcp   ecs-prod-whatever-108-prod-whatever-feb2cdae8894f389d001

See new tasks are stuck in pending state. If i manually run docker stop 274ba55bfca5 it deploys new tasks.

I am not sure whether it's related to this issue or something else.

@vsiddharth
Copy link
Contributor

@optimisticanshul you could send them over to sidvin at amazon.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants