-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry creating dynamic networks if not found #25550
Conversation
In cases there are failures in task start, swarmkit might be trying to restart the task again in the same node which might keep failing. This creates a race where when a failed task is getting removed it might remove the associated network while another task for the same service or a different service but connected to the same network is proceeding with starting the container knowing that the network is still present. Fix this by reacting to `ErrNoSuchNetwork` error during container start by trying to recreate the managed networks. If they have been removed it will be recreated. If they are already present nothing bad will happen. Signed-off-by: Jana Radhakrishnan <[email protected]>
Why do we allow removing networks that are used by existing (stopped) containers? |
No, we don't remove networks for stopped containers. But some of them do get removed(when there are continuous failures) and when they get removed we do remove the networks. That is when this issue happens. |
@mrjana If that is the case then I think this check could be in |
@tonistiigi Yeah you don't get an error in create because there is no network setup involved during create. The error only happens during start because that is where the network setup happens and that is why the check is there. |
LGTM |
return err | ||
} | ||
|
||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to change this now, but it's typically somewhere i would just use a goto
:)
LGTM |
In cases there are failures in task start, swarmkit might be trying to
restart the task again in the same node which might keep failing. This
creates a race where when a failed task is getting removed it might
remove the associated network while another task for the same service
or a different service but connected to the same network is proceeding
with starting the container knowing that the network is still
present. Fix this by reacting to
ErrNoSuchNetwork
error duringcontainer start by trying to recreate the managed networks. If they
have been removed it will be recreated. If they are already present
nothing bad will happen.
Fixes #25528
Signed-off-by: Jana Radhakrishnan [email protected]