Skip to content

Conversation

@abhi
Copy link
Contributor

@abhi abhi commented Jun 28, 2017

This PR contains a fix for moby/moby#30321. There was a moby/moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.

Signed-off-by: Abhinandan Prativadi [email protected]

sandbox.go Outdated
func (sb *sandbox) DisableService() error {
logrus.Debugf("DisableService %s START", sb.containerID)
for _, ep := range sb.getConnectedEndpoints() {
if err := ep.deleteServiceInfoFromCluster(sb, "DisableService"); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was intentionally removed to avoid a race with the sbLeave. So far the sbLeave is the only code path that actually cleans the name resolution/load balancer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt service disabling be independent of sbleave ? Removing the endpoint from a network part of a sandbox must be decoupled from the service disabling. This helps in make before break scenarios. If its tightly coupled then we might end up with issue like the above. Just like enabling service

Copy link

@sanimej sanimej Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcrisciani That explains why the moby #31142 patch doesn't help in 17.06.

A valid use case shouldn't be broken to fix the race. In this case we can avoid the race by checking the serviceEnabled flag. IOW, deleteServiceInfoFromCluster will be called only once for one endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanimej I'm not saying is wrong. I'm saying that we have to test it properly for all the other cases that we work in the past week to be sure to not introduce again races here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on that. From what I see in here - I see that service is disabled first before task shutdown. The race condition was seen on the remote nodes when this happens ? Just making sure I incorporate the scenario in my testing.
With the current code we are basically adding 2 second delay to do nothing so we have increased our convergence time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getConnected endpoints coming from the sandbox should return only the endpoints that are locally deployed. Conceptually should work fine, is it guaranteed serialization between EnableService/DisableService and CreateEndpoint?
Like is it possible that the DisableService is called while a CreateEndpoint happens and so the Endpoint is added to the service after being disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getconnected endpoints is syncronised operation. So I wouldnt expect this to change around either of the operations.
Enabling of a service happens after endpoint creation and then the service enabled flag is set (I see a problem in old code. will correct that as well) and Disabling of a service is done on task shutdown before the deleteEndpoint.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinandanpb serviceEnabled is set before the addServiceInfoToCluster call and its reset if the call fails. Can you clarify what the problem in the current code is ?

agent.go Outdated

func (ep *endpoint) deleteServiceInfoFromCluster(sb *sandbox, method string) error {

if !ep.serviceEnabled {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the serviceEnable flag being saved into the store in the Marshal of the endpoint, are we sure that does make sense then this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be only part of the local store and in memory right ? This need not be propogate to the store. So Marshalling and Unmarshalling may not be needed ? getendpoints returns endpoints from in memory.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possibility of concurrent calls too deleteServiceInfoFromCluster is from sb.DisableService (which calls getConnectedEndpoints) and ep.Leave(). The latter fetches the endpoint from the store. So the serviceEnabled field has to be marshalled. Also, after setting it ep has to be updated in the store.

sandbox.go Outdated
func (sb *sandbox) DisableService() error {
logrus.Debugf("DisableService %s START", sb.containerID)
for _, ep := range sb.getConnectedEndpoints() {
if err := ep.deleteServiceInfoFromCluster(sb, "DisableService"); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getConnected endpoints coming from the sandbox should return only the endpoints that are locally deployed. Conceptually should work fine, is it guaranteed serialization between EnableService/DisableService and CreateEndpoint?
Like is it possible that the DisableService is called while a CreateEndpoint happens and so the Endpoint is added to the service after being disabled?

@abhi abhi force-pushed the rolling_update branch 4 times, most recently from 87a975e to af2c60c Compare June 29, 2017 18:13
@abhi abhi force-pushed the rolling_update branch from af2c60c to 209a12e Compare July 7, 2017 05:32
@mvdstam
Copy link

mvdstam commented Jul 19, 2017

@thaJeztah Any chance to get this merged in so we can close #30321?

@thaJeztah
Copy link
Member

@mvdstam I'm not a maintainer in this repository, but I can try asking what the status is on this one

@abhi
Copy link
Contributor Author

abhi commented Jul 19, 2017

ping @mavenugo @fcrisciani @sanimej

@mvdstam
Copy link

mvdstam commented Jul 19, 2017

@thaJeztah Thanks, sorry for pulling you in on this one. 😉

agent.go Outdated
}

func (ep *endpoint) addServiceInfoToCluster(sb *sandbox) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove these 2 extra spaces?

sandbox.go Outdated
store := n.getController().getStore(ep.DataScope())

if store == nil {
return fmt.Errorf("store not found for scope %s on disable service on endpoint:%s", ep.DataScope(), ep.Name())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on enable service

sandbox.go Outdated
}

if err := store.GetObject(datastore.Key(ep.Key()...), ep); err != nil {
return fmt.Errorf("could not update the kvobject to latest on endpoint count update: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endpoint count update?

sandbox.go Outdated
enabledServices = true

for {
if err := n.getController().updateToStore(ep); err == nil || err != datastore.ErrKeyModified {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if retrying on ErrKeyModified can be a problem, let's imagine that you have the following scenario:
enableService
sbLeave
they both gets executed in order but then they will start racing on the store, it is possible that the sbLeave will write before the enableService, so enableService will update and retry and will leave the service enabled flag to 1. Makes sense?
I think in case the Key got modified maybe we just have to give up or we need another flag on the endpoint to understand if the sbLeave just arrived before so we don't need to do anything

sandbox.go Outdated
// If there is an error while enabling service on any endpoint
// use defer to disable the service on the sandbox before returning
// the error.
defer func(enabledServices *bool) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use isServiceEnabled?

@mvdstam
Copy link

mvdstam commented Aug 1, 2017

@abhinandanpb @fcrisciani Hey guys, any chance you can take a look at this? Getting this merged soon would be really awesome; can't wait to have true zero-downtime deployments with Docker Swarm. 😃

Thanks!

sandbox.go Outdated

n, err := ep.getNetworkFromStore()
if err != nil {
logrus.Warnf("could not enable service on sandbox:%s,endpoint:%s,err: %v", sb.ID(), ep.Name(), err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this log seems a little missleading, in reality the service got properly enabled but the getNetwork is failing, correct?

sandbox.go Outdated
return fmt.Errorf("could not update state for endpoint %s into cluster: %v", ep.Name(), err)
}
// enable service on the endpoint copy in the sandbox
ep.enableService(true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also happen at the end, do we need it 2 times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like I deleted a line. One was to update the store and one to update in mem copy. Let me rework this and test it out.

sandbox.go Outdated
//Write the ep copy to the store
ep.enableService(true)
if err := n.getController().updateToStore(ep); err != nil {
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a warning here? also why do we break and not continue in this case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also when we break here basically the service got already updated but now the flag will remain at false, so there would not be any disable service happening neither in the case of sbLeave

sandbox.go Outdated
continue
}

store := n.getController().getStore(ep.DataScope())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks unnecessary, the same logic is replicated inside the updateToStore()

cs := c.getStore(kvObject.DataScope())
	if cs == nil {
		return ErrDataStoreNotInitialized(kvObject.DataScope())
	}

sandbox.go Outdated

ep.enableService(false)

store := n.getController().getStore(ep.DataScope())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this is done inside the updateToStore()

sandbox.go Outdated
// enable service on the endpoint copy in the sandbox
ep.enableService(true)

n, err := ep.getNetworkFromStore()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this on the top before the addServiceInfoToCluster, same reason, if we cannot save the flag then we cannot activate the service

@mvdstam
Copy link

mvdstam commented Aug 6, 2017

Hey @abhinandanpb, your patch has been confirmed to definitely fix moby/moby#30321. This means that true zero-downtime deployments with Docker Swarm Mode will finally be possible; this is really awesome! Thanks for your work and hope to see it released shortly. 👍

@hjdr4
Copy link

hjdr4 commented Aug 7, 2017

Hello guys,

Patch idea is good while current implementation is bad.

I tried to make a proper fix for this based on the discussions: hjdr4@234cdaa

I don't want to put mess in the process but I would appreciate this bug is fixed because this is show stopper for production usage of Swarm. 17.06.1 is on the way, having the fix merged would be nice (I may be dreaming, I don't know what can go or not in minor updates).

I hope that helps.

@abhi
Copy link
Contributor Author

abhi commented Aug 7, 2017

hjdr4 thanks for the patch. That is not the right one either. We need the copy from the store to work on the latest ep object . I have the fix. I am going to update the PR with few test cases to ensure corner cases are covered.
@mvdstam on it .

sirlatrom referenced this pull request in moby/swarmkit Aug 7, 2017
orchestrator/update: Only shut down old tasks on success
@hjdr4
Copy link

hjdr4 commented Aug 7, 2017

addServiceInfoToCluster() calls ep.getNetwork() so it sounded correct to use the same network object accross the whole interface call. I just dig into the code so I'm probably wrong (and I couldn't find any doc on store designs). The good news is that people are working on the subject. I'll wait for the proper patch release. Thank you!

@mavenugo
Copy link
Contributor

mavenugo commented Aug 9, 2017

@hjdr4 as @abhinandanpb suggested, we have a few issues with the store/caching layer and the way endpoints are handled in the layers above the store/cache layer. Hence it is safe to not assume the object references. We do have an open item to address this issue.

// hit for an endpoint while disabling the service.
func (sb *sandbox) DisableService() (err error) {
logrus.Debugf("DisableService %s START", sb.containerID)
failedEps := []string{}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable looks like is not used in the for loop below

@fcrisciani
Copy link

did we close the discussion about the store?

@mvdstam
Copy link

mvdstam commented Aug 25, 2017

@abhinandanpb @fcrisciani @mavenugo @hjdr4

Hi guys, any news on this? Any chance we could get this merged soon? Thanks!

@abhi
Copy link
Contributor Author

abhi commented Aug 25, 2017

@mvdstam apologize for the delay. We are testing out few scenarios. There have been some race conditions that have been fixed lately. So in the context of that we are ensuring we don't introduce another issue fixing this.

@mvdstam
Copy link

mvdstam commented Aug 25, 2017

@abhinandanpb Thanks for the quick response, I understand. Just checking if this is still on the radar. 😃

Thanks!

@fcrisciani
Copy link

@abhinandanpb I was wondering if there is a way to change the container orchestrator to ensure to call the disable service under any condition. If that is possible we can remove the logic from the sbLeave and keep it only in the disable service and also we can ignore the use of the database. WDYT?

@sirlatrom
Copy link

Is there anything I as an outsider can do to help progress this PR? Since we're running Docker Enterprise Edition, we cannot 'simply' compile dockerd with this PR in it, as we will then not be able to get support. We've already tried to get more attention to this issue through our support plan.

@sirlatrom
Copy link

PS: From what I can see, the merge conflict is really very simple to resolve, as it's just two lines with independent changes that just happen to be at the same lines.

@mvdstam
Copy link

mvdstam commented Sep 15, 2017

I totally agree with @sirlatrom. Again: I'd be happy to provide more information with tests and examples as I've done throughout moby/moby#30321. Hope to see some movement in this PR soon; we really need to be able to perform rolling updates without service interruption.

Ping @abhinandanpb @fcrisciani

@fcrisciani
Copy link

hey guys, sorry for the delay @abhinandanpb was OOO at the moby summit, I think he will take care of it next week when he will be back

@andrewhsu
Copy link
Contributor

@abhinandanpb any update on this?

@abhi
Copy link
Contributor Author

abhi commented Sep 25, 2017

@andrewhsu Will be working on this PR - this week. We have decided to make changes both in moby/moby and libnetwork.

@sirlatrom
Copy link

@abhi @andrewhsu That sounds great! Is there an issue and/or a PR in moby/moby that I can track for progress and/or help out with reproduction and the like?

@mvdstam
Copy link

mvdstam commented Oct 12, 2017

@andrewhsu @abhi Has this been worked on yet? Can we expect moby/moby#30321 to be fixed with the next release?

@YarekTyshchenko
Copy link

Can confirm that this fixes downtime issues in swarm deployments for us too. Debian 8.9, patched Docker 17.06

@codecov-io
Copy link

codecov-io commented Jan 8, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@a1dfea3). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1824   +/-   ##
=========================================
  Coverage          ?   40.02%           
=========================================
  Files             ?      138           
  Lines             ?    22146           
  Branches          ?        0           
=========================================
  Hits              ?     8863           
  Misses            ?    11986           
  Partials          ?     1297
Impacted Files Coverage Δ
sandbox.go 40.62% <0%> (ø)
endpoint.go 53.93% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1dfea3...4aeb1fc. Read the comment docs.

sandbox.go Outdated
ep.enableService(false)
return fmt.Errorf("could not update state for endpoint %s into cluster: %v", ep.Name(), err)
if !ep.isServiceEnabled() {
n, err = ep.getNetworkFromStore()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the store add here? can we keep the original logic that uses the *Endpoint coming from the map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store doesnt add anything. Iam following the convention on of operating on the store object

return err
}

if e := ep.deleteServiceInfoFromCluster(sb, "sbLeave"); e != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this one will it still work the network disconnect?

Copy link
Contributor Author

@abhi abhi Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the corresponding moby/moby#35960 PR for this

sandbox.go Outdated
ep.enableService(false)
if ep.isServiceEnabled() {
n, err := ep.getNetworkFromStore()
if err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, do we need the store data?

This PR contains a fix for moby/moby#30321. There was a moby/moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.

Signed-off-by: abhi <[email protected]>
Copy link

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcrisciani fcrisciani merged commit 315a076 into moby:master Jan 9, 2018
abhi added a commit to abhi/docker that referenced this pull request Jan 17, 2018
This PR contains a fix for moby#30321. There was a moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.In order
to support it and ensure consitency of enabling and disable service knob
from the daemon, we need to ensure we disable service when we release
the network from the container. This helps in making the enable and
disable service less racy. The corresponding part of libnetwork fix is
part of moby/libnetwork#1824

Signed-off-by: abhi <[email protected]>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this pull request Jan 18, 2018
This PR contains a fix for moby/moby#30321. There was a moby/moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.In order
to support it and ensure consitency of enabling and disable service knob
from the daemon, we need to ensure we disable service when we release
the network from the container. This helps in making the enable and
disable service less racy. The corresponding part of libnetwork fix is
part of moby/libnetwork#1824

Signed-off-by: abhi <[email protected]>
Upstream-commit: a042e5a
Component: engine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.