Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry creating dynamic networks if not found #25550

Merged
merged 1 commit into from
Aug 10, 2016
Merged

Conversation

mrjana
Copy link
Contributor

@mrjana mrjana commented Aug 9, 2016

In cases there are failures in task start, swarmkit might be trying to
restart the task again in the same node which might keep failing. This
creates a race where when a failed task is getting removed it might
remove the associated network while another task for the same service
or a different service but connected to the same network is proceeding
with starting the container knowing that the network is still
present. Fix this by reacting to ErrNoSuchNetwork error during
container start by trying to recreate the managed networks. If they
have been removed it will be recreated. If they are already present
nothing bad will happen.

Fixes #25528

Signed-off-by: Jana Radhakrishnan [email protected]

In cases there are failures in task start, swarmkit might be trying to
restart the task again in the same node which might keep failing. This
creates a race where when a failed task is getting removed it might
remove the associated network while another task for the same service
or a different service but connected to the same network is proceeding
with starting the container knowing that the network is still
present. Fix this by reacting to `ErrNoSuchNetwork` error during
container start by trying to recreate the managed networks. If they
have been removed it will be recreated. If they are already present
nothing bad will happen.

Signed-off-by: Jana Radhakrishnan <[email protected]>
@tonistiigi
Copy link
Member

Why do we allow removing networks that are used by existing (stopped) containers?

@mrjana
Copy link
Contributor Author

mrjana commented Aug 9, 2016

No, we don't remove networks for stopped containers. But some of them do get removed(when there are continuous failures) and when they get removed we do remove the networks. That is when this issue happens.

@tonistiigi
Copy link
Member

@mrjana If that is the case then I think this check could be in create instead of start but with manual testing I was able to remove a network that was being used and only get an error on container start.

@mrjana
Copy link
Contributor Author

mrjana commented Aug 10, 2016

@tonistiigi Yeah you don't get an error in create because there is no network setup involved during create. The error only happens during start because that is where the network setup happens and that is why the check is there.

@tonistiigi
Copy link
Member

LGTM

return err
}

continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to change this now, but it's typically somewhere i would just use a goto :)

@tiborvass
Copy link
Contributor

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants