Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always run DeleteVeth on cleanup #101

Merged
merged 2 commits into from
Jul 18, 2022
Merged

Always run DeleteVeth on cleanup #101

merged 2 commits into from
Jul 18, 2022

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented Jul 18, 2022

Summary

Previously to this change we would skip running DeleteVeth on cleanup if
cni plugins was not provided with an IPAM config to delete.

The agent codebase, however, does not always pass in an IPAM config on
purpose, when it is only cleaning up a single task, see includeIPAMConfig bool in
https://github.com/aws/amazon-ecs-agent/blame/71a0f6765cc469913a0b09728196feb0d0d33e37/agent/engine/docker_task_engine.go#L1779
and https://github.com/aws/amazon-ecs-agent/blame/71a0f6765cc469913a0b09728196feb0d0d33e37/agent/engine/docker_task_engine.go#L1760

With this change, when we skip the IPAM delete, we no longer immediately
return from the del function. Instead we let the code fall through to
the engine.DeleteVeth call to make sure that veth interfaces are always
cleaned up when a task is being cleaned up.

This change should prevent "exchange full" type errors that may occur when awsvpc
tasks are not always cleaned up properly and the container instance reaches 1023
veth interfaces on the instance's ecs-bridge bridge (the bridge used by awsvpc
network mode tasks)

Testing

New tests cover the changes:

Functional test suite run, run manually on staging/testing ECS clusters.

Description for the changelog

Bugfix: Always run DeleteVeth on cleanup, fixes veth "exchange full" errors

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Previously to this change we would skip running DeleteVeth on cleanup if
cni plugins was not provided with an IPAM config to delete.

The agent codebase, however, does not always pass in an IPAM config on
purpose, when it is only cleaning up a single task, see
https://github.com/aws/amazon-ecs-agent/blame/71a0f6765cc469913a0b09728196feb0d0d33e37/agent/engine/docker_task_engine.go#L1779
and https://github.com/aws/amazon-ecs-agent/blame/71a0f6765cc469913a0b09728196feb0d0d33e37/agent/engine/docker_task_engine.go#L1760

With this change, when we skip the IPAM delete, we no longer immediately
return from the del function. Instead we let the code fall through to
the engine.DeleteVeth call to make sure that veth interfaces are always
cleaned up when a task is being cleaned up.
Copy link
Member

@fierlion fierlion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall note -- it would be helpful to have more detail about the testing done which was specific to this change (if any).

detailLog("Running IPAM plugin DEL", args, conf, "")
err = engine.RunIPAMPluginDel(conf.IPAM.Type, args.StdinData)
if err != nil {
return err
Copy link
Contributor

@yinyic yinyic Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to log a warning here and continue on with veth deletion? I believe IPAM resource management is done separately from the actual veth management, and even if one of them fails we can still attempt the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, I'm not sure under what conditions IPAM delete will fail, but I can change it to continue to the veth delete regardless

@sparrc sparrc force-pushed the always-del-veth branch from 6b437c1 to 9dcbfe2 Compare July 18, 2022 18:39
@sparrc sparrc merged commit db58647 into master Jul 18, 2022
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request Jul 18, 2022
Pulling in cni plugins change for fixing veth cleanup behavior. See
aws/amazon-ecs-cni-plugins#101 for more
details.
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request Jul 18, 2022
Pulling in cni plugins change for fixing veth cleanup behavior. See
aws/amazon-ecs-cni-plugins#101 for more
details.
err = engine.RunIPAMPluginDel(conf.IPAM.Type, args.StdinData)
if err != nil {
return err
detailLogInfo("IPAM configuration not found, skip DEL for IPAM", args, conf, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are leaking the task IP address when taking this path.

I'm not sure if this is the right fix. It probably doesn't cause additional harm, but the log level should be higher than Info. Leaking the IP address if a serious failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I think in that case there might be two issues within the del function. This PR is fixing one of the two (skipping VethDelete). As for the the other issue, it seems like we might need to synchronize on what makes sense for the agent to pass to the cni plugins.

That being said, it's still not clear to me how the ipam cni plugin fits into this, since it never logs anything.

sparrc added a commit to aws/amazon-ecs-agent that referenced this pull request Jul 19, 2022
Pulling in cni plugins change for fixing veth cleanup behavior. See
aws/amazon-ecs-cni-plugins#101 for more
details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants