Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding PartOf in ECS Service's Docker Service dependency #4277

Merged
merged 3 commits into from
Aug 12, 2024

Conversation

hozkaya2000
Copy link
Contributor

@hozkaya2000 hozkaya2000 commented Aug 9, 2024

Summary

Update ECS Service to be closer to pre-1.85.0 behavior while maintaining critical aspects of resiliency introduced here: #4233

Implementation details

Added PartOf=docker.service to the following files:
packaging/amazon-linux-ami-integrated/ecs.service
packaging/generic-deb-integrated/debian/ecs.service
packaging/generic-rpm-integrated/ecs.service

This addition ensures that ecs.service stops when docker is stopped manually and does not automatically come back up. It will also ensure that restarting docker will restart ecs.service as it used to before 1.85.0.

Testing

Setup

  1. Created a custom ecs-init rpm with the additions.
  2. Then, using this rpm, created a custom AMI using an old AMI: amzn2-ami-ecs-hvm-2.0.20230428-x86_64-ebs as a template.

Behavior

Verified that behavior was changed to pre-1.85.0 behavior [without reverting resiliency]:

  • Services start automatically on instance launch, able to register to cluster
  • When ecs is stopped manually, it doesn't restart by itself
  • When ecs is stopped manually, it doesn't restart when docker is restarted
  • When docker is stopped manually, ecs is stopped in response
  • When docker is started from a stopped state, ecs is not started
  • When docker is restarted while running manually, ecs restarts if it was running

Resiliency

Simulated race condition that would cause the issue mentioned/addressed in this pull request: #4233.

Simulated a docker socket failure while launching ecs with the following command on a test instance with the specified custom AMI:
sudo systemctl start ecs & sudo systemctl restart docker.socket &

This simulates a failure because occasionally, since these are running in parallel, ecs.service will fail when docker.socket is restarting right while ecs is checking for the docker dependency.

The result was a success, as ecs.service came back up automatically.

In comparison, this old AMI without the updated package, in the same failure scenario, was not able to bring ecs.service back up:

Load Testing

Launched 15000 instances with the specified custom AMI and verified that all 15000 of them were able to register to ECS.

New tests cover the changes: yes

Description for the changelog

Revert ecs.service behavior while maintaining resiliency

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hozkaya2000 hozkaya2000 requested a review from a team as a code owner August 9, 2024 22:50
@hozkaya2000 hozkaya2000 changed the title Adding PartOf to Wants in ECS Service's Docker Service dependency Adding PartOf in ECS Service's Docker Service dependency Aug 9, 2024
danehlim
danehlim previously approved these changes Aug 9, 2024
@singholt
Copy link
Contributor

singholt commented Aug 9, 2024

Please update this commit's message: e7c5fc3. It makes one think you are replacing Wants with PartOf, which isn't the case.

And don't forget to squash your commits before merging, thanks!

@hozkaya2000 hozkaya2000 merged commit 1333a7d into aws:dev Aug 12, 2024
40 checks passed
singholt pushed a commit to singholt/amazon-ecs-agent that referenced this pull request Aug 13, 2024
singholt pushed a commit to singholt/amazon-ecs-agent that referenced this pull request Aug 13, 2024
@hozkaya2000 hozkaya2000 deleted the hozkaya2000/fix-ecs-service branch August 15, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants