ECS Pull timeout from inactivity stuck in endless loop #1396

jonstelly · 2018-05-25T18:05:54Z

Summary

ECS pull Inactivity Timeout is preventing new containers from starting

Description

Windows ECS host pulling an image from ECR based on windowsservercore times out pulling the image due to inactivitry. This causes the pull to be canceled, then a new attempt to pull is started which also fails, loop repeats forever. This is from the ap-southeast-2 region.

Expected Behavior

That a pull of an image from ECR to an ECS instance in the same region reliably succeeds on the first try.

The EC2 Instance is so beat up CPU-wise by the constant pulling of images, I can barely do anything on the machine (RDPed from US to AUS) but I'm trying to capture logs.

I've tried setting allow-nondistributable-artifacts to push the windows OS base layers but I don't think that's going to help because of a bug with the older version of Docker on Windows that wouldn't pull those layers from the private repository.

At the very least, there needs to be a configuration switch like the container start time limits so this behavior can be turned off because I don't think the Docker idle detection code is working the way we think. Network access from AUS to US docker hub is slow but every time I manually connect to the server and do a docker pull from the public hub, it succeeds.

Observed Behavior

Endless loop of failure

Environment Details

####ECS EC2 Info
Windows_Server-2016-English-Full-ECS_Optimized-2018.03.26 (ami-3cfc305e)
t2.medium with t2 unlimited on

####ECS Agent
1.18.0

Docker Info

PS C:\Users\Administrator> docker info
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 5
Server Version: 17.06.2-ee-7
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd json-file logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 14393 (14393.2155.amd64fre.rs1_release_1.180305-1842)
Operating System: Windows Server 2016 Datacenter
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 4GiB
Name: EC2AMAZ-RKOL2QF
ID: H23L:IZ6Q:CQRP:2JTZ:JZVJ:BYJG:V2HP:4B3E:QXE7:BAMN:Z4PA:SKDD
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

Supporting Log Snippets

Event Log events (ECS Agent event happens first, warning, followed by Docker event, error)

AmazonECSAgent - DockerGoClient: failed to pull image 123.dkr.ecr.ap-southeast-2.amazonaws.com/myimage:1.2.3: inactivity time exceeded timeout

Docker - The description for Event ID 1 from source docker cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event: 

Not continuing with pull after error: context canceled

The text was updated successfully, but these errors were encountered:

haikuoliu · 2018-05-25T20:18:53Z

Hi @jonstelly,

Currently dockerPullInactivityTimeout is 1 min, and if image pull times out due to inactivity, Agent will retry with a backoff.

We will fix it by making dockerPullInactivityTimeout configurable.

To mitigate, you can roll back to version 1.17.2 where we don't fail image pull when inactivity timeout happens.

Or you can try ECS_AGENT_PULL_BEHAVIOR which is introduced recently to define pull behavior if there is a use case for you, see details here

Thanks,
Haikuo

jonstelly · 2018-05-29T17:13:47Z

@haikuoliu thanks. I don't think we can roll back to 1.17.2 because we've set ECS_CONTAINER_START_TIMEOUT to 15 minutes and that setting is new in 1.17.3. I'm going to go back and look at my notes but I believe I was seeing two different errors when we were pulling images and starting containers. One is the idle timeout error mentioned above but I also believe I saw the error where the container startup times out after 8m (the default on windows).

I believe the behavior for start timeout before 1.17.3 was that the 8m limit was hard-coded and not configurable, is that right? So if I go back to 1.17.2 it may fix this idle timeout issue, but I'll just end up hitting the start timeout issue.

Thanks,
Jon

haikuoliu · 2018-05-30T05:07:53Z

Yes, you are correct, the inactivity timeout and start container timeout were both introduced in 1.17.3.

Do you have a lot of instances to maintain or need to change the image frequently? If not, you can try setting ECS_AGENT_PULL_BEHAVIOR to prefer-cached, and pull image manually before running tasks.

jonstelly · 2018-05-30T22:08:11Z

It's not a lot of instances and images don't change frequently... We've started looking at adding some commands to our user data script to pre-load the images and I'll look into changing the pull behavior too.

GaryAult · 2018-07-03T19:52:28Z

This is impacting us too. I'm going to try the older agent version.

mazaruz · 2018-07-18T09:41:58Z

I'm using ECS Agent Version 1.17.3 with Windows_Server-2016-English-Full-ECS_Optimized-2018.05.17 (ami-8e4b7bf2) running on m5.large and I did face this issue. But now my ECS instance can pull the image from ECR. Here's my workaround,

Once EC2 has launched, remote to the server and add below Environment Variables to Windows,
Name: ECS_CONTAINER_START_TIMEOUT
Value: 15m

Name: ECS_IMAGE_PULL_BEHAVIOR
Value: prefer-cached

And restart ECS-Agent Services...

petderek · 2018-10-18T23:44:30Z

Configurable timeouts were added in #1566 and released with agent 1.21.0. It was released in our new linux AMIs today, and will be included in our next windows release.

petderek · 2018-11-02T01:02:02Z

We've updated the Windows AMIs today to include agent 1.21.0. The newest agent has a tunable parameter: ECS_IMAGE_PULL_INACTIVITY_TIMEOUT.

You can set this value to higher values than the default (3 minutes on Windows) as needed depending on the size of your containers.

haikuoliu added the kind/bug label May 25, 2018

haikuoliu added the scope/ECS Agent label May 31, 2018

richardpen mentioned this issue Jul 25, 2018

ECS tasks take a while to reach running from pending #1453

Closed

wattdave mentioned this issue Sep 10, 2018

Add ECS_DOCKER_PULL_INACTIVITY_TIMEOUT - Address issue 1396 #1566

Closed

8 tasks

petderek added this to the 1.21.0 milestone Oct 10, 2018

petderek removed this from the 1.21.0 milestone Oct 18, 2018

petderek closed this as completed Nov 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS Pull timeout from inactivity stuck in endless loop #1396

ECS Pull timeout from inactivity stuck in endless loop #1396

jonstelly commented May 25, 2018

haikuoliu commented May 25, 2018

jonstelly commented May 29, 2018

haikuoliu commented May 30, 2018

jonstelly commented May 30, 2018

GaryAult commented Jul 3, 2018

mazaruz commented Jul 18, 2018

petderek commented Oct 18, 2018

petderek commented Nov 2, 2018

ECS Pull timeout from inactivity stuck in endless loop #1396

ECS Pull timeout from inactivity stuck in endless loop #1396

Comments

jonstelly commented May 25, 2018

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

Docker Info

Supporting Log Snippets

haikuoliu commented May 25, 2018

jonstelly commented May 29, 2018

haikuoliu commented May 30, 2018

jonstelly commented May 30, 2018

GaryAult commented Jul 3, 2018

mazaruz commented Jul 18, 2018

petderek commented Oct 18, 2018

petderek commented Nov 2, 2018