Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Agent should allow for more control over task cleanup #2968

Closed
jasonumiker opened this issue Jul 30, 2021 · 3 comments
Closed

ECS Agent should allow for more control over task cleanup #2968

jasonumiker opened this issue Jul 30, 2021 · 3 comments

Comments

@jasonumiker
Copy link

jasonumiker commented Jul 30, 2021

When ECS stops a Task/container it seems to do a docker stop and then wait by default three hours before it does the docker rm to clean up the Task.

There is an option today in the agent, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, that will allow you to change that 3 hour default delay between the stop and the clean up.

With Images the agent exposes more granular control of that process - the number of images deleted per cycle etc. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/automated_image_cleanup.html

In my customer's use-case they are stopping thousands of Tasks at nearly the same time as part of their deployment and would like to stagger the cleanups of them more - as at exactly 3 hours after their deployment their EC2 Instances, and all the Tasks on them, are highly impacted by these mass cleanups. In this case they highly use docker ephemeral local volumes so these task cleanups trigger lots of IO cleaning up these temporary volumes when ECS Agent does the docker rm on it.

Since the ECS Agent owns this operation, and the logic around it, they were hoping for more control to throttle/stagger these Task cleanups to minimise their level of disruption via similar new options to what is available for the Images.

@ellenthsu
Copy link

Thanks for filing the issue - we're looking into it

@fenxiong
Copy link
Contributor

fenxiong commented Jul 30, 2021

Our current thought of how to fix this issue is:

We will introduce a new ecs agent environment variable ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER, which can be optionally specified to a time duration. When specified, when each task is stopped, instead of waiting for ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION duration of time before cleanup, we instead wait for a random time duration between [ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION + ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER]. This way, each task can be cleanup at a different time, instead of all at the same time.

Feel free to let us know if you have any comment about the above proposed fix.

We will introduce a new ecs agent environment variable ECS_ENGINE_TASK_CLEANUP_WAIT_JITTER_PERCENTAGE, with a default value of 100.0, and valid value [0.0, 100.0]. When each task is stopped, instead of waiting for ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION duration of time before cleanup, it instead waits for a random time duration between [ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION * ECS_ENGINE_TASK_CLEANUP_WAIT_JITTER_PERCENTAGE / 100.0, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION]. This way, when ECS_ENGINE_TASK_CLEANUP_WAIT_JITTER_PERCENTAGE is specified to something below 100, each task will be cleanup at a different time, instead of all at the same time, and therefore the mass cleanup issue should be alleviated.

@fenxiong
Copy link
Contributor

Fix has been added in version 1.55.0 - https://github.com/aws/amazon-ecs-agent/releases/tag/v1.55.0. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants