Fix bug where pause container was not always cleaned up #2824

angelcar · 2021-03-02T04:21:03Z

Summary

This fixes a bug that can prevent the pause container network to be cleaned up when ECS Agent is restarted due to dockerd restarts. Failing to clean the pause container network might cause issues for future tasks to start.

When dockerd is restarted, all running containers (along with ECS Agent) are stopped. After dockerd and ECS Agent restart successfully, the latter proceeds to stop all the tasks that were running previous to the restart since it realizes the respective containers are no longer running. When this happens, the pause container network is not cleaned up since the workflow is different to the normal operations.

Implementation details

Do a check after the managed tasks is stopped to see if the pause container still needs to be cleaned up. Most of the times the pause container has already been cleaned up by the time the managed task stops, but as mentioned above, there are scenarios where this might not happen.

Since now we are invoking container cleanup in more than one place (previously it only happened in stopContainer method), the operation was made idempotent by storing a flag in the container to indicate whether that container was torn down.
For now that flag is only used for the pause container, but might be useful for other containers that need explicit teardown in the future.

Testing

There are a number of unit and integration tests that check for proper pause container cleanup. Those tests should still pass since the previous logic is not modified and a few new assertions are added to check if the new tear down flag was set.

In addition, manually tested that pause container network is cleaned under normal circumstances as well as after docker restarts.

New tests cover the changes: yes

Description for the changelog

Fix bug where pause container was not always cleaned up

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

sharanyad · 2021-03-02T18:47:50Z

agent/engine/docker_task_engine.go

+// checkTearDownPauseContainer idempotently tears down the pause container network when the pause container's known
+//or desired status is stopped.
+func (engine *DockerTaskEngine) checkTearDownPauseContainer(task *apitask.Task) {
+	for _, container := range task.Containers {


instead of iterating through all containers, can we check if task is not awsvpc enabled (GetENI == nil) and return quickly?

is it time or readability that would be improved? In terms of time, there can be a max of 10 containers so we would gain a few nanoseconds :p

as a practice, just like to return quickly wherever possible :) also, containers count could be increased in the future

Fair enough. changed.

angelcar added the bot/test label Mar 2, 2021

amazon-ecs-bot removed the bot/test label Mar 2, 2021

angelcar force-pushed the angelcar_pause_container_cleanup_fix branch 4 times, most recently from 8f97f8b to 23db43c Compare March 2, 2021 18:34

angelcar added the bot/test label Mar 2, 2021

amazon-ecs-bot removed the bot/test label Mar 2, 2021

sharanyad reviewed Mar 2, 2021

View reviewed changes

Fix bug where pause container was not always cleaned up

4229491

angelcar force-pushed the angelcar_pause_container_cleanup_fix branch from 23db43c to 4229491 Compare March 2, 2021 19:03

sharanyad approved these changes Mar 2, 2021

View reviewed changes

mythri-garaga approved these changes Mar 2, 2021

View reviewed changes

shubham2892 approved these changes Mar 2, 2021

View reviewed changes

angelcar added the bot/test label Mar 2, 2021

amazon-ecs-bot removed the bot/test label Mar 2, 2021

angelcar added the bot/test label Mar 2, 2021

amazon-ecs-bot removed the bot/test label Mar 2, 2021

angelcar added the bot/test label Mar 2, 2021

amazon-ecs-bot removed the bot/test label Mar 2, 2021

angelcar added the bot/test label Mar 2, 2021

amazon-ecs-bot removed the bot/test label Mar 2, 2021

angelcar merged commit 89e9cb0 into aws:dev Mar 2, 2021

angelcar deleted the angelcar_pause_container_cleanup_fix branch March 2, 2021 21:43

angelcar mentioned this pull request Mar 26, 2021

Fix bug that could incorrectly clean up pause container before other containers #2838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where pause container was not always cleaned up #2824

Fix bug where pause container was not always cleaned up #2824

angelcar commented Mar 2, 2021 •

edited

Loading

sharanyad Mar 2, 2021 •

edited

Loading

angelcar Mar 2, 2021

sharanyad Mar 2, 2021

angelcar Mar 2, 2021

Fix bug where pause container was not always cleaned up #2824

Fix bug where pause container was not always cleaned up #2824

Conversation

angelcar commented Mar 2, 2021 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

sharanyad Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

angelcar Mar 2, 2021

Choose a reason for hiding this comment

sharanyad Mar 2, 2021

Choose a reason for hiding this comment

angelcar Mar 2, 2021

Choose a reason for hiding this comment

angelcar commented Mar 2, 2021 •

edited

Loading

sharanyad Mar 2, 2021 •

edited

Loading