Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing Requires to Wants to make ECS more resilient and check if docker service is running in ECS service file #4233

Merged
merged 1 commit into from
Jul 4, 2024

Conversation

mye956
Copy link
Contributor

@mye956 mye956 commented Jul 3, 2024

Summary

The goal of this PR is to make the ECS service more resilient in regards to when docker is up and available. There can be cases where docker goes down during instance bootstrapping (e.g. upgrades/restarts) and ECS service is trying to start up. However, it can't since docker is down and will fail due to a service dependency error which isn't taken into consideration for the restart policy that we defined. In other words, since the ECS service never starts up it's not consider a failure. To try to address this inconsistent race condition, we will be changing Requires to Wants and will have a check if docker is up and running during PreStart for the ECS service.

The reason why we're switching over to Wants is because it's less restrictive than Requires. Essentially, if the units within the ecs.service file are not found or fail to start, the current unit(ECS) will continue to function. This is the recommended way to configure most dependency relationships. Again, this implies a parallel activation unless modified by other directives. (ref)

In addition, we'll also have the ECS service to check if docker is running as part of its PreStart so that it can fail a bit faster. Otherwise, it will fail once ECS reaches its Start phase where it tries to start up the ECS Agent container.

Implementation details

  • Changed Requires to Wants. These were made in the following files:
    • packaging/amazon-linux-ami-integrated/ecs.service
    • packaging/generic-deb-integrated/debian/ecs.service
    • packaging/generic-rpm-integrated/ecs.service
  • Added check if docker is up and running within packaging/amazon-linux-ami-integrated/ecs.service via systemctl is-active docker
    • If it's not, we will exit with error code 1 (Note: We've defined ECS to ignore exit code 5)
    • We won't include this check in the other service files as the location of systemctl can vary across distros. Instead, we will look to change our public docs regarding installing ECS on non-AL AMIs (ref)

Testing

Built a custom AMI off of an older ami (specifically 20220304) where docker would do an auto-upgrade.

Load tested with 16600 instances and all were able to start up ECS after a docker upgrade

NEW INSTANCES LAUNCHED = 100 ; TOTAL INSTANCES LAUNCHED = 16600
TOTAL INSTANCES REGISTERED = 16600 / 16600

A few other edge cases that were tested:

  • Stopping both ECS + Docker -> Start back up ECS
    Behavior: Once ECS tried starting back up, docker will also start back up
  • Stopping just Docker while ECS is running
    Behavior:ECS service will fail. After a bit, ECS will restart which will also bring docker back up
  • Disabling just Docker while ECS is running
    Behavior: ECS service will fail. After a bit, ECS will restart which will also bring docker back up
  • Run ECS service on a vanilla AL2 AMI
    Behavior: ECS service will never reach a running state and will fail due to docker not being present when it tries starting up the ECS Agent docker container. It will retry starting up every 10 second which is expected based on what we’ve defined in our systemd service unit file for ECS.

Old Behavior:

  • Stopping both ECS + Docker -> Start back up ECS
    Behavior: Once ECS starts back up, docker will also start back up
  • Stopping just Docker while ECS is running
    Behavior: ECS will also go down and will remain down. It will only start back up if a user manually starts it back up.
  • Disabling just Docker while ECS is running
    Behavior: ECS will also go down and will remain down. It will only start back up if a user manually starts it back up.
  • Run ECS on a vanilla AL2 AMI
    Behavior: ECS fails to start up at all.

Description for the changelog

Enhancement: Change ECS service file from requires to wants and checking if docker is running during ECS service PreStart

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mye956 mye956 added the bot/test label Jul 3, 2024
@mye956 mye956 marked this pull request as ready for review July 3, 2024 17:11
@mye956 mye956 requested a review from a team as a code owner July 3, 2024 17:11
@yinyic
Copy link
Contributor

yinyic commented Jul 3, 2024

Could we also add to the testing section what the current (old) behavior is for comparison? Otherwise the change looks good. Thanks Mike!

@mye956 mye956 force-pushed the yemike/fix-ecs-service branch from 7057331 to c0abf3a Compare July 3, 2024 21:04
@mye956 mye956 added the bot/test label Jul 3, 2024
@mye956 mye956 force-pushed the yemike/fix-ecs-service branch from c0abf3a to dfedbaa Compare July 3, 2024 22:57
@mye956 mye956 added the bot/test label Jul 3, 2024
@mye956 mye956 merged commit 93ad0b1 into aws:dev Jul 4, 2024
40 checks passed
@amogh09 amogh09 mentioned this pull request Jul 8, 2024
@ryzr
Copy link

ryzr commented Aug 7, 2024

Any chance this is related?
#4270

My ecs-agent service suddenly stopped on all of my instances and doesn't attempt to restart. Not sure if possibly due to SSM automatically applying updates?

EDIT: reverting this change fixed my issue

@juangc-wavyssa
Copy link

Same issue here. How can we force the agent to use 1.84.0? We are having this issue for all our 49 ECS.Happened on last two sundays in all ours ECS, and we had to reboot all the agents manually, and yesterday, only on one ECS.

@danehlim
Copy link
Contributor

danehlim commented Aug 9, 2024

@ryzr @juangc-wavyssa thank you for reaching out. Our team is aware of this issue and currently in the process of addressing it as part of #4277.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants