ECS Agent should allow for more control over task cleanup #2968

jasonumiker · 2021-07-30T02:01:50Z

When ECS stops a Task/container it seems to do a docker stop and then wait by default three hours before it does the docker rm to clean up the Task.

There is an option today in the agent, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, that will allow you to change that 3 hour default delay between the stop and the clean up.

With Images the agent exposes more granular control of that process - the number of images deleted per cycle etc. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/automated_image_cleanup.html

In my customer's use-case they are stopping thousands of Tasks at nearly the same time as part of their deployment and would like to stagger the cleanups of them more - as at exactly 3 hours after their deployment their EC2 Instances, and all the Tasks on them, are highly impacted by these mass cleanups. In this case they highly use docker ephemeral local volumes so these task cleanups trigger lots of IO cleaning up these temporary volumes when ECS Agent does the docker rm on it.

Since the ECS Agent owns this operation, and the logic around it, they were hoping for more control to throttle/stagger these Task cleanups to minimise their level of disruption via similar new options to what is available for the Images.

The text was updated successfully, but these errors were encountered:

ellenthsu · 2021-07-30T15:44:16Z

Thanks for filing the issue - we're looking into it

fenxiong · 2021-07-30T17:56:44Z

Our current thought of how to fix this issue is:

We will introduce a new ecs agent environment variable ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER, which can be optionally specified to a time duration. When specified, when each task is stopped, instead of waiting for ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION duration of time before cleanup, we instead wait for a random time duration between [ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION + ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER]. This way, each task can be cleanup at a different time, instead of all at the same time.

Feel free to let us know if you have any comment about the above proposed fix.

We will introduce a new ecs agent environment variable ECS_ENGINE_TASK_CLEANUP_WAIT_JITTER_PERCENTAGE, with a default value of 100.0, and valid value [0.0, 100.0]. When each task is stopped, instead of waiting for ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION duration of time before cleanup, it instead waits for a random time duration between [ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION * ECS_ENGINE_TASK_CLEANUP_WAIT_JITTER_PERCENTAGE / 100.0, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION]. This way, when ECS_ENGINE_TASK_CLEANUP_WAIT_JITTER_PERCENTAGE is specified to something below 100, each task will be cleanup at a different time, instead of all at the same time, and therefore the mass cleanup issue should be alleviated.

fenxiong · 2021-08-12T15:51:29Z

Fix has been added in version 1.55.0 - https://github.com/aws/amazon-ecs-agent/releases/tag/v1.55.0. Closing

fenxiong added the kind/enhancement label Jul 30, 2021

fenxiong mentioned this issue Jul 30, 2021

Introduce jitter for task cleanup wait duration #2969

Merged

fenxiong closed this as completed Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS Agent should allow for more control over task cleanup #2968

ECS Agent should allow for more control over task cleanup #2968

jasonumiker commented Jul 30, 2021 •

edited

Loading

ellenthsu commented Jul 30, 2021

fenxiong commented Jul 30, 2021 •

edited

Loading

fenxiong commented Aug 12, 2021

ECS Agent should allow for more control over task cleanup #2968

ECS Agent should allow for more control over task cleanup #2968

Comments

jasonumiker commented Jul 30, 2021 • edited Loading

ellenthsu commented Jul 30, 2021

fenxiong commented Jul 30, 2021 • edited Loading

fenxiong commented Aug 12, 2021

jasonumiker commented Jul 30, 2021 •

edited

Loading

fenxiong commented Jul 30, 2021 •

edited

Loading