-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing Requires to Wants to make ECS more resilient and check if docker service is running in ECS service file #4233
Conversation
Could we also add to the testing section what the current (old) behavior is for comparison? Otherwise the change looks good. Thanks Mike! |
7057331
to
c0abf3a
Compare
…er service is up during PreStart
c0abf3a
to
dfedbaa
Compare
Any chance this is related? My ecs-agent service suddenly stopped on all of my instances and doesn't attempt to restart. Not sure if possibly due to SSM automatically applying updates? EDIT: reverting this change fixed my issue |
Same issue here. How can we force the agent to use 1.84.0? We are having this issue for all our 49 ECS.Happened on last two sundays in all ours ECS, and we had to reboot all the agents manually, and yesterday, only on one ECS. |
@ryzr @juangc-wavyssa thank you for reaching out. Our team is aware of this issue and currently in the process of addressing it as part of #4277. |
Summary
The goal of this PR is to make the ECS service more resilient in regards to when docker is up and available. There can be cases where docker goes down during instance bootstrapping (e.g. upgrades/restarts) and ECS service is trying to start up. However, it can't since docker is down and will fail due to a service dependency error which isn't taken into consideration for the restart policy that we defined. In other words, since the ECS service never starts up it's not consider a failure. To try to address this inconsistent race condition, we will be changing
Requires
toWants
and will have a check if docker is up and running during PreStart for the ECS service.The reason why we're switching over to
Wants
is because it's less restrictive thanRequires
. Essentially, if the units within theecs.service
file are not found or fail to start, the current unit(ECS) will continue to function. This is the recommended way to configure most dependency relationships. Again, this implies a parallel activation unless modified by other directives. (ref)In addition, we'll also have the ECS service to check if docker is running as part of its PreStart so that it can fail a bit faster. Otherwise, it will fail once ECS reaches its Start phase where it tries to start up the ECS Agent container.
Implementation details
Requires
toWants
. These were made in the following files:packaging/amazon-linux-ami-integrated/ecs.service
packaging/generic-deb-integrated/debian/ecs.service
packaging/generic-rpm-integrated/ecs.service
packaging/amazon-linux-ami-integrated/ecs.service
viasystemctl is-active docker
Testing
Built a custom AMI off of an older ami (specifically
20220304
) where docker would do an auto-upgrade.Load tested with 16600 instances and all were able to start up ECS after a docker upgrade
A few other edge cases that were tested:
Behavior: Once ECS tried starting back up, docker will also start back up
Behavior:ECS service will fail. After a bit, ECS will restart which will also bring docker back up
Behavior: ECS service will fail. After a bit, ECS will restart which will also bring docker back up
Behavior: ECS service will never reach a running state and will fail due to docker not being present when it tries starting up the ECS Agent docker container. It will retry starting up every 10 second which is expected based on what we’ve defined in our systemd service unit file for ECS.
Old Behavior:
Behavior: Once ECS starts back up, docker will also start back up
Behavior: ECS will also go down and will remain down. It will only start back up if a user manually starts it back up.
Behavior: ECS will also go down and will remain down. It will only start back up if a user manually starts it back up.
Behavior: ECS fails to start up at all.
Description for the changelog
Enhancement: Change ECS service file from requires to wants and checking if docker is running during ECS service PreStart
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.