-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS Pull timeout from inactivity stuck in endless loop #1396
Comments
Hi @jonstelly, Currently We will fix it by making To mitigate, you can roll back to version 1.17.2 where we don't fail image pull when inactivity timeout happens. Or you can try Thanks, |
@haikuoliu thanks. I don't think we can roll back to 1.17.2 because we've set ECS_CONTAINER_START_TIMEOUT to 15 minutes and that setting is new in 1.17.3. I'm going to go back and look at my notes but I believe I was seeing two different errors when we were pulling images and starting containers. One is the idle timeout error mentioned above but I also believe I saw the error where the container startup times out after 8m (the default on windows). I believe the behavior for start timeout before 1.17.3 was that the 8m limit was hard-coded and not configurable, is that right? So if I go back to 1.17.2 it may fix this idle timeout issue, but I'll just end up hitting the start timeout issue. Thanks, |
Yes, you are correct, the inactivity timeout and start container timeout were both introduced in 1.17.3. Do you have a lot of instances to maintain or need to change the image frequently? If not, you can try setting |
It's not a lot of instances and images don't change frequently... We've started looking at adding some commands to our user data script to pre-load the images and I'll look into changing the pull behavior too. |
This is impacting us too. I'm going to try the older agent version. |
I'm using ECS Agent Version 1.17.3 with Windows_Server-2016-English-Full-ECS_Optimized-2018.05.17 (ami-8e4b7bf2) running on m5.large and I did face this issue. But now my ECS instance can pull the image from ECR. Here's my workaround, Once EC2 has launched, remote to the server and add below Environment Variables to Windows, Name: ECS_IMAGE_PULL_BEHAVIOR And restart ECS-Agent Services... |
Configurable timeouts were added in #1566 and released with agent 1.21.0. It was released in our new linux AMIs today, and will be included in our next windows release. |
We've updated the Windows AMIs today to include agent 1.21.0. The newest agent has a tunable parameter: You can set this value to higher values than the default (3 minutes on Windows) as needed depending on the size of your containers. |
Summary
ECS pull Inactivity Timeout is preventing new containers from starting
Description
Windows ECS host pulling an image from ECR based on windowsservercore times out pulling the image due to inactivitry. This causes the pull to be canceled, then a new attempt to pull is started which also fails, loop repeats forever. This is from the ap-southeast-2 region.
Expected Behavior
That a pull of an image from ECR to an ECS instance in the same region reliably succeeds on the first try.
The EC2 Instance is so beat up CPU-wise by the constant pulling of images, I can barely do anything on the machine (RDPed from US to AUS) but I'm trying to capture logs.
I've tried setting allow-nondistributable-artifacts to push the windows OS base layers but I don't think that's going to help because of a bug with the older version of Docker on Windows that wouldn't pull those layers from the private repository.
At the very least, there needs to be a configuration switch like the container start time limits so this behavior can be turned off because I don't think the Docker idle detection code is working the way we think. Network access from AUS to US docker hub is slow but every time I manually connect to the server and do a docker pull from the public hub, it succeeds.
Observed Behavior
Endless loop of failure
Environment Details
####ECS EC2 Info
Windows_Server-2016-English-Full-ECS_Optimized-2018.03.26 (ami-3cfc305e)
t2.medium with t2 unlimited on
####ECS Agent
1.18.0
Docker Info
PS C:\Users\Administrator> docker info
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 5
Server Version: 17.06.2-ee-7
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd json-file logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 14393 (14393.2155.amd64fre.rs1_release_1.180305-1842)
Operating System: Windows Server 2016 Datacenter
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 4GiB
Name: EC2AMAZ-RKOL2QF
ID: H23L:IZ6Q:CQRP:2JTZ:JZVJ:BYJG:V2HP:4B3E:QXE7:BAMN:Z4PA:SKDD
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Supporting Log Snippets
Event Log events (ECS Agent event happens first, warning, followed by Docker event, error)
The text was updated successfully, but these errors were encountered: