Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Pull timeout from inactivity stuck in endless loop #1396

Closed
jonstelly opened this issue May 25, 2018 · 8 comments
Closed

ECS Pull timeout from inactivity stuck in endless loop #1396

jonstelly opened this issue May 25, 2018 · 8 comments

Comments

@jonstelly
Copy link

Summary

ECS pull Inactivity Timeout is preventing new containers from starting

Description

Windows ECS host pulling an image from ECR based on windowsservercore times out pulling the image due to inactivitry. This causes the pull to be canceled, then a new attempt to pull is started which also fails, loop repeats forever. This is from the ap-southeast-2 region.

Expected Behavior

That a pull of an image from ECR to an ECS instance in the same region reliably succeeds on the first try.

The EC2 Instance is so beat up CPU-wise by the constant pulling of images, I can barely do anything on the machine (RDPed from US to AUS) but I'm trying to capture logs.

I've tried setting allow-nondistributable-artifacts to push the windows OS base layers but I don't think that's going to help because of a bug with the older version of Docker on Windows that wouldn't pull those layers from the private repository.

At the very least, there needs to be a configuration switch like the container start time limits so this behavior can be turned off because I don't think the Docker idle detection code is working the way we think. Network access from AUS to US docker hub is slow but every time I manually connect to the server and do a docker pull from the public hub, it succeeds.

Observed Behavior

Endless loop of failure

Environment Details

####ECS EC2 Info
Windows_Server-2016-English-Full-ECS_Optimized-2018.03.26 (ami-3cfc305e)
t2.medium with t2 unlimited on

####ECS Agent
1.18.0

Docker Info

PS C:\Users\Administrator> docker info
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 5
Server Version: 17.06.2-ee-7
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd json-file logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 14393 (14393.2155.amd64fre.rs1_release_1.180305-1842)
Operating System: Windows Server 2016 Datacenter
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 4GiB
Name: EC2AMAZ-RKOL2QF
ID: H23L:IZ6Q:CQRP:2JTZ:JZVJ:BYJG:V2HP:4B3E:QXE7:BAMN:Z4PA:SKDD
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

Supporting Log Snippets

Event Log events (ECS Agent event happens first, warning, followed by Docker event, error)

AmazonECSAgent - DockerGoClient: failed to pull image 123.dkr.ecr.ap-southeast-2.amazonaws.com/myimage:1.2.3: inactivity time exceeded timeout

Docker - The description for Event ID 1 from source docker cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

Not continuing with pull after error: context canceled
@haikuoliu
Copy link
Contributor

Hi @jonstelly,

Currently dockerPullInactivityTimeout is 1 min, and if image pull times out due to inactivity, Agent will retry with a backoff.

We will fix it by making dockerPullInactivityTimeout configurable.

To mitigate, you can roll back to version 1.17.2 where we don't fail image pull when inactivity timeout happens.

Or you can try ECS_AGENT_PULL_BEHAVIOR which is introduced recently to define pull behavior if there is a use case for you, see details here

Thanks,
Haikuo

@jonstelly
Copy link
Author

@haikuoliu thanks. I don't think we can roll back to 1.17.2 because we've set ECS_CONTAINER_START_TIMEOUT to 15 minutes and that setting is new in 1.17.3. I'm going to go back and look at my notes but I believe I was seeing two different errors when we were pulling images and starting containers. One is the idle timeout error mentioned above but I also believe I saw the error where the container startup times out after 8m (the default on windows).

I believe the behavior for start timeout before 1.17.3 was that the 8m limit was hard-coded and not configurable, is that right? So if I go back to 1.17.2 it may fix this idle timeout issue, but I'll just end up hitting the start timeout issue.

Thanks,
Jon

@haikuoliu
Copy link
Contributor

Yes, you are correct, the inactivity timeout and start container timeout were both introduced in 1.17.3.

Do you have a lot of instances to maintain or need to change the image frequently? If not, you can try setting ECS_AGENT_PULL_BEHAVIOR to prefer-cached, and pull image manually before running tasks.

@jonstelly
Copy link
Author

It's not a lot of instances and images don't change frequently... We've started looking at adding some commands to our user data script to pre-load the images and I'll look into changing the pull behavior too.

@GaryAult
Copy link

GaryAult commented Jul 3, 2018

This is impacting us too. I'm going to try the older agent version.

@mazaruz
Copy link

mazaruz commented Jul 18, 2018

I'm using ECS Agent Version 1.17.3 with Windows_Server-2016-English-Full-ECS_Optimized-2018.05.17 (ami-8e4b7bf2) running on m5.large and I did face this issue. But now my ECS instance can pull the image from ECR. Here's my workaround,

Once EC2 has launched, remote to the server and add below Environment Variables to Windows,
Name: ECS_CONTAINER_START_TIMEOUT
Value: 15m

Name: ECS_IMAGE_PULL_BEHAVIOR
Value: prefer-cached

And restart ECS-Agent Services...

@petderek
Copy link
Contributor

Configurable timeouts were added in #1566 and released with agent 1.21.0. It was released in our new linux AMIs today, and will be included in our next windows release.

@petderek petderek removed this from the 1.21.0 milestone Oct 18, 2018
@petderek
Copy link
Contributor

petderek commented Nov 2, 2018

We've updated the Windows AMIs today to include agent 1.21.0. The newest agent has a tunable parameter: ECS_IMAGE_PULL_INACTIVITY_TIMEOUT.

You can set this value to higher values than the default (3 minutes on Windows) as needed depending on the size of your containers.

@petderek petderek closed this as completed Nov 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants