Delete service info from cluster when service is disabled #1824

abhi · 2017-06-28T17:29:28Z

This PR contains a fix for moby/moby#30321. There was a moby/moby#31142
PR intending to fix the issue by adding a delay between disabling the
service in the cluster and the shutdown of the tasks. However
disabling the service was not deleting the service info in the cluster.
Added a fix to delete service info from cluster and verified using siege
to ensure there is zero downtime on rolling update of a service.

Signed-off-by: Abhinandan Prativadi [email protected]

fcrisciani · 2017-06-28T17:37:05Z

sandbox.go

 func (sb *sandbox) DisableService() error {
 	logrus.Debugf("DisableService %s START", sb.containerID)
 	for _, ep := range sb.getConnectedEndpoints() {
+		if err := ep.deleteServiceInfoFromCluster(sb, "DisableService"); err != nil {


This logic was intentionally removed to avoid a race with the sbLeave. So far the sbLeave is the only code path that actually cleans the name resolution/load balancer.

shouldnt service disabling be independent of sbleave ? Removing the endpoint from a network part of a sandbox must be decoupled from the service disabling. This helps in make before break scenarios. If its tightly coupled then we might end up with issue like the above. Just like enabling service

@fcrisciani That explains why the moby #31142 patch doesn't help in 17.06.

A valid use case shouldn't be broken to fix the race. In this case we can avoid the race by checking the serviceEnabled flag. IOW, deleteServiceInfoFromCluster will be called only once for one endpoint.

@sanimej I'm not saying is wrong. I'm saying that we have to test it properly for all the other cases that we work in the past week to be sure to not introduce again races here.

agreed on that. From what I see in here - I see that service is disabled first before task shutdown. The race condition was seen on the remote nodes when this happens ? Just making sure I incorporate the scenario in my testing.
With the current code we are basically adding 2 second delay to do nothing so we have increased our convergence time.

The getConnected endpoints coming from the sandbox should return only the endpoints that are locally deployed. Conceptually should work fine, is it guaranteed serialization between EnableService/DisableService and CreateEndpoint?
Like is it possible that the DisableService is called while a CreateEndpoint happens and so the Endpoint is added to the service after being disabled?

getconnected endpoints is syncronised operation. So I wouldnt expect this to change around either of the operations.
Enabling of a service happens after endpoint creation and then the service enabled flag is set (I see a problem in old code. will correct that as well) and Disabling of a service is done on task shutdown before the deleteEndpoint.

@abhinandanpb serviceEnabled is set before the addServiceInfoToCluster call and its reset if the call fails. Can you clarify what the problem in the current code is ?

fcrisciani · 2017-06-28T20:40:32Z

agent.go


 func (ep *endpoint) deleteServiceInfoFromCluster(sb *sandbox, method string) error {
+
+	if !ep.serviceEnabled {


I don't see the serviceEnable flag being saved into the store in the Marshal of the endpoint, are we sure that does make sense then this value?

it should be only part of the local store and in memory right ? This need not be propogate to the store. So Marshalling and Unmarshalling may not be needed ? getendpoints returns endpoints from in memory.

The possibility of concurrent calls too deleteServiceInfoFromCluster is from sb.DisableService (which calls getConnectedEndpoints) and ep.Leave(). The latter fetches the endpoint from the store. So the serviceEnabled field has to be marshalled. Also, after setting it ep has to be updated in the store.

fcrisciani · 2017-06-28T20:46:08Z

sandbox.go

 func (sb *sandbox) DisableService() error {
 	logrus.Debugf("DisableService %s START", sb.containerID)
 	for _, ep := range sb.getConnectedEndpoints() {
+		if err := ep.deleteServiceInfoFromCluster(sb, "DisableService"); err != nil {


The getConnected endpoints coming from the sandbox should return only the endpoints that are locally deployed. Conceptually should work fine, is it guaranteed serialization between EnableService/DisableService and CreateEndpoint?
Like is it possible that the DisableService is called while a CreateEndpoint happens and so the Endpoint is added to the service after being disabled?

mvdstam · 2017-07-19T17:06:24Z

@thaJeztah Any chance to get this merged in so we can close #30321?

thaJeztah · 2017-07-19T18:28:25Z

@mvdstam I'm not a maintainer in this repository, but I can try asking what the status is on this one

abhi · 2017-07-19T18:34:15Z

ping @mavenugo @fcrisciani @sanimej

mvdstam · 2017-07-19T19:08:01Z

@thaJeztah Thanks, sorry for pulling you in on this one. 😉

fcrisciani · 2017-07-19T18:33:28Z

agent.go

 }

 func (ep *endpoint) addServiceInfoToCluster(sb *sandbox) error {
+


can you remove these 2 extra spaces?

fcrisciani · 2017-07-19T18:50:43Z

sandbox.go

+		store := n.getController().getStore(ep.DataScope())
+
+		if store == nil {
+			return fmt.Errorf("store not found for scope %s on disable service on endpoint:%s", ep.DataScope(), ep.Name())


on enable service

fcrisciani · 2017-07-19T18:52:28Z

sandbox.go

+				}
+
+				if err := store.GetObject(datastore.Key(ep.Key()...), ep); err != nil {
+					return fmt.Errorf("could not update the kvobject to latest on endpoint count update: %v", err)


endpoint count update?

fcrisciani · 2017-07-19T19:49:12Z

sandbox.go

+			enabledServices = true
+
+			for {
+				if err := n.getController().updateToStore(ep); err == nil || err != datastore.ErrKeyModified {


I wonder if retrying on ErrKeyModified can be a problem, let's imagine that you have the following scenario:
enableService
sbLeave
they both gets executed in order but then they will start racing on the store, it is possible that the sbLeave will write before the enableService, so enableService will update and retry and will leave the service enabled flag to 1. Makes sense?
I think in case the Key got modified maybe we just have to give up or we need another flag on the endpoint to understand if the sbLeave just arrived before so we don't need to do anything

fcrisciani · 2017-07-19T19:54:26Z

sandbox.go

+	// If there is an error while enabling service on any endpoint
+	// use defer to disable the service on the sandbox before returning
+	// the error.
+	defer func(enabledServices *bool) {


why not use isServiceEnabled?

mvdstam · 2017-08-01T14:38:25Z

@abhinandanpb @fcrisciani Hey guys, any chance you can take a look at this? Getting this merged soon would be really awesome; can't wait to have true zero-downtime deployments with Docker Swarm. 😃

Thanks!

fcrisciani · 2017-08-03T20:43:52Z

sandbox.go

+
+			n, err := ep.getNetworkFromStore()
+			if err != nil {
+				logrus.Warnf("could not enable service on sandbox:%s,endpoint:%s,err: %v", sb.ID(), ep.Name(), err)


this log seems a little missleading, in reality the service got properly enabled but the getNetwork is failing, correct?

fcrisciani · 2017-08-03T20:44:23Z

sandbox.go

 				return fmt.Errorf("could not update state for endpoint %s into cluster: %v", ep.Name(), err)
 			}
+			// enable service on the endpoint copy in the sandbox
+			ep.enableService(true)


this also happen at the end, do we need it 2 times?

looks like I deleted a line. One was to update the store and one to update in mem copy. Let me rework this and test it out.

fcrisciani · 2017-08-03T20:45:48Z

sandbox.go

+			//Write the ep copy to the store
+			ep.enableService(true)
+			if err := n.getController().updateToStore(ep); err != nil {
+				break


should we add a warning here? also why do we break and not continue in this case?

also when we break here basically the service got already updated but now the flag will remain at false, so there would not be any disable service happening neither in the case of sbLeave

fcrisciani · 2017-08-03T21:09:22Z

sandbox.go

+				continue
+			}
+
+			store := n.getController().getStore(ep.DataScope())


this looks unnecessary, the same logic is replicated inside the updateToStore()

cs := c.getStore(kvObject.DataScope()) if cs == nil { return ErrDataStoreNotInitialized(kvObject.DataScope()) }

fcrisciani · 2017-08-03T21:09:37Z

sandbox.go

+
+			ep.enableService(false)
+
+			store := n.getController().getStore(ep.DataScope())


same here, this is done inside the updateToStore()

fcrisciani · 2017-08-03T21:36:36Z

sandbox.go

+			// enable service on the endpoint copy in the sandbox
+			ep.enableService(true)
+
+			n, err := ep.getNetworkFromStore()


I would move this on the top before the addServiceInfoToCluster, same reason, if we cannot save the flag then we cannot activate the service

mvdstam · 2017-08-06T12:52:23Z

Hey @abhinandanpb, your patch has been confirmed to definitely fix moby/moby#30321. This means that true zero-downtime deployments with Docker Swarm Mode will finally be possible; this is really awesome! Thanks for your work and hope to see it released shortly. 👍

hjdr4 · 2017-08-07T11:34:49Z

Hello guys,

Patch idea is good while current implementation is bad.

I tried to make a proper fix for this based on the discussions: hjdr4@234cdaa

I don't want to put mess in the process but I would appreciate this bug is fixed because this is show stopper for production usage of Swarm. 17.06.1 is on the way, having the fix merged would be nice (I may be dreaming, I don't know what can go or not in minor updates).

I hope that helps.

abhi · 2017-08-07T13:35:34Z

hjdr4 thanks for the patch. That is not the right one either. We need the copy from the store to work on the latest ep object . I have the fix. I am going to update the PR with few test cases to ensure corner cases are covered.
@mvdstam on it .

orchestrator/update: Only shut down old tasks on success

hjdr4 · 2017-08-07T14:16:29Z

addServiceInfoToCluster() calls ep.getNetwork() so it sounded correct to use the same network object accross the whole interface call. I just dig into the code so I'm probably wrong (and I couldn't find any doc on store designs). The good news is that people are working on the subject. I'll wait for the proper patch release. Thank you!

mavenugo · 2017-08-09T14:44:24Z

@hjdr4 as @abhinandanpb suggested, we have a few issues with the store/caching layer and the way endpoints are handled in the layers above the store/cache layer. Hence it is safe to not assume the object references. We do have an open item to address this issue.

fcrisciani · 2017-08-09T15:22:30Z

sandbox.go

+// hit for an endpoint while disabling the service.
+func (sb *sandbox) DisableService() (err error) {
 	logrus.Debugf("DisableService %s START", sb.containerID)
+	failedEps := []string{}


this variable looks like is not used in the for loop below

fcrisciani · 2017-08-17T03:41:29Z

did we close the discussion about the store?

mvdstam · 2017-08-25T05:45:26Z

@abhinandanpb @fcrisciani @mavenugo @hjdr4

Hi guys, any news on this? Any chance we could get this merged soon? Thanks!

abhi · 2017-08-25T06:39:30Z

@mvdstam apologize for the delay. We are testing out few scenarios. There have been some race conditions that have been fixed lately. So in the context of that we are ensuring we don't introduce another issue fixing this.

mvdstam · 2017-08-25T07:21:21Z

@abhinandanpb Thanks for the quick response, I understand. Just checking if this is still on the radar. 😃

Thanks!

fcrisciani · 2017-08-28T16:02:29Z

@abhinandanpb I was wondering if there is a way to change the container orchestrator to ensure to call the disable service under any condition. If that is possible we can remove the logic from the sbLeave and keep it only in the disable service and also we can ignore the use of the database. WDYT?

sirlatrom · 2017-09-13T19:41:01Z

Is there anything I as an outsider can do to help progress this PR? Since we're running Docker Enterprise Edition, we cannot 'simply' compile dockerd with this PR in it, as we will then not be able to get support. We've already tried to get more attention to this issue through our support plan.

sirlatrom · 2017-09-14T19:27:40Z

PS: From what I can see, the merge conflict is really very simple to resolve, as it's just two lines with independent changes that just happen to be at the same lines.

mvdstam · 2017-09-15T07:01:18Z

I totally agree with @sirlatrom. Again: I'd be happy to provide more information with tests and examples as I've done throughout moby/moby#30321. Hope to see some movement in this PR soon; we really need to be able to perform rolling updates without service interruption.

Ping @abhinandanpb @fcrisciani

fcrisciani · 2017-09-17T19:50:22Z

hey guys, sorry for the delay @abhinandanpb was OOO at the moby summit, I think he will take care of it next week when he will be back

andrewhsu · 2017-09-25T23:30:26Z

@abhinandanpb any update on this?

abhi · 2017-09-25T23:31:45Z

@andrewhsu Will be working on this PR - this week. We have decided to make changes both in moby/moby and libnetwork.

sirlatrom · 2017-09-29T13:14:50Z

@abhi @andrewhsu That sounds great! Is there an issue and/or a PR in moby/moby that I can track for progress and/or help out with reproduction and the like?

mvdstam · 2017-10-12T07:16:08Z

@andrewhsu @abhi Has this been worked on yet? Can we expect moby/moby#30321 to be fixed with the next release?

YarekTyshchenko · 2017-11-17T19:36:20Z

Can confirm that this fixes downtime issues in swarm deployments for us too. Debian 8.9, patched Docker 17.06

codecov-io · 2018-01-08T22:53:27Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a1dfea3). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master    #1824   +/-   ##
=========================================
  Coverage          ?   40.02%           
=========================================
  Files             ?      138           
  Lines             ?    22146           
  Branches          ?        0           
=========================================
  Hits              ?     8863           
  Misses            ?    11986           
  Partials          ?     1297

Impacted Files	Coverage Δ
sandbox.go	`40.62% <0%> (ø)`
endpoint.go	`53.93% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1dfea3...4aeb1fc. Read the comment docs.

fcrisciani · 2018-01-09T18:50:54Z

sandbox.go

-				ep.enableService(false)
-				return fmt.Errorf("could not update state for endpoint %s into cluster: %v", ep.Name(), err)
+		if !ep.isServiceEnabled() {
+			n, err = ep.getNetworkFromStore()


What does the store add here? can we keep the original logic that uses the *Endpoint coming from the map?

store doesnt add anything. Iam following the convention on of operating on the store object

fcrisciani · 2018-01-09T18:55:18Z

endpoint.go

 		return err
 	}

-	if e := ep.deleteServiceInfoFromCluster(sb, "sbLeave"); e != nil {


without this one will it still work the network disconnect?

check the corresponding moby/moby#35960 PR for this

fcrisciani · 2018-01-09T18:56:18Z

sandbox.go

-		ep.enableService(false)
+		if ep.isServiceEnabled() {
+			n, err := ep.getNetworkFromStore()
+			if err != nil {


same here, do we need the store data?

This PR contains a fix for moby/moby#30321. There was a moby/moby#31142 PR intending to fix the issue by adding a delay between disabling the service in the cluster and the shutdown of the tasks. However disabling the service was not deleting the service info in the cluster. Added a fix to delete service info from cluster and verified using siege to ensure there is zero downtime on rolling update of a service. Signed-off-by: abhi <[email protected]>

fcrisciani

LGTM

This PR contains a fix for moby#30321. There was a moby#31142 PR intending to fix the issue by adding a delay between disabling the service in the cluster and the shutdown of the tasks. However disabling the service was not deleting the service info in the cluster. Added a fix to delete service info from cluster and verified using siege to ensure there is zero downtime on rolling update of a service.In order to support it and ensure consitency of enabling and disable service knob from the daemon, we need to ensure we disable service when we release the network from the container. This helps in making the enable and disable service less racy. The corresponding part of libnetwork fix is part of moby/libnetwork#1824 Signed-off-by: abhi <[email protected]>

This PR contains a fix for moby/moby#30321. There was a moby/moby#31142 PR intending to fix the issue by adding a delay between disabling the service in the cluster and the shutdown of the tasks. However disabling the service was not deleting the service info in the cluster. Added a fix to delete service info from cluster and verified using siege to ensure there is zero downtime on rolling update of a service.In order to support it and ensure consitency of enabling and disable service knob from the daemon, we need to ensure we disable service when we release the network from the container. This helps in making the enable and disable service less racy. The corresponding part of libnetwork fix is part of moby/libnetwork#1824 Signed-off-by: abhi <[email protected]> Upstream-commit: a042e5a Component: engine

abhi mentioned this pull request Jun 28, 2017

Zero-downtime deployments with rolling upgrades moby/moby#30321

Closed

fcrisciani reviewed Jun 28, 2017

View reviewed changes

abhi force-pushed the rolling_update branch 4 times, most recently from 87a975e to af2c60c Compare June 29, 2017 18:13

abhi force-pushed the rolling_update branch from af2c60c to 209a12e Compare July 7, 2017 05:32

fcrisciani reviewed Jul 19, 2017

View reviewed changes

mvdstam mentioned this pull request Jul 25, 2017

Swarm routing failures moby/moby#32079

Closed

abhi force-pushed the rolling_update branch from 209a12e to fbb526e Compare August 1, 2017 18:32

fcrisciani reviewed Aug 3, 2017

View reviewed changes

sirlatrom referenced this pull request in moby/swarmkit Aug 7, 2017

Merge pull request #2308 from aaronlehmann/start-then-stop-failure

310b691

orchestrator/update: Only shut down old tasks on success

abhi force-pushed the rolling_update branch from fbb526e to c04b469 Compare August 8, 2017 13:21

mavenugo approved these changes Aug 9, 2017

View reviewed changes

fcrisciani reviewed Aug 9, 2017

View reviewed changes

abhi force-pushed the rolling_update branch from c04b469 to a93f1df Compare August 9, 2017 15:56

thaJeztah mentioned this pull request Nov 2, 2017

Network problem in docker 17.09 with swarm mode moby/moby#35358

Closed

FrenchBen mentioned this pull request Dec 4, 2017

Requests to ELB hang indefinitely docker-archive/for-aws#121

Open

abhi force-pushed the rolling_update branch from a93f1df to 2ad9ae7 Compare January 8, 2018 22:34

abhi mentioned this pull request Jan 8, 2018

Disable service on release network moby/moby#35960

Merged

fcrisciani reviewed Jan 9, 2018

View reviewed changes

abhi force-pushed the rolling_update branch from 2ad9ae7 to 4aeb1fc Compare January 9, 2018 21:53

fcrisciani approved these changes Jan 9, 2018

View reviewed changes

fcrisciani merged commit 315a076 into moby:master Jan 9, 2018

thaJeztah mentioned this pull request Jan 10, 2018

Bump libnetwork to a1dfea384b39779552a3b4837ea9303194950976 moby/moby#35976

Merged

thaJeztah mentioned this pull request Feb 4, 2018

[17.06] Delete service info from cluster when service is disabled #2067

Closed

thaJeztah mentioned this pull request Jun 5, 2018

Cherry-pick graceful LB changes #2157

Merged


		func (ep endpoint) deleteServiceInfoFromCluster(sb sandbox, method string) error {

		if !ep.serviceEnabled {

		}

		func (ep endpoint) addServiceInfoToCluster(sb sandbox) error {


		ep.enableService(false)

		store := n.getController().getStore(ep.DataScope())

Delete service info from cluster when service is disabled #1824

Delete service info from cluster when service is disabled #1824

Uh oh!

Conversation

abhi commented Jun 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanimej Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvdstam commented Jul 19, 2017

Uh oh!

thaJeztah commented Jul 19, 2017

Uh oh!

abhi commented Jul 19, 2017

Uh oh!

mvdstam commented Jul 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvdstam commented Aug 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvdstam commented Aug 6, 2017

Uh oh!

hjdr4 commented Aug 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhi commented Aug 7, 2017

Uh oh!

hjdr4 commented Aug 7, 2017

Uh oh!

mavenugo commented Aug 9, 2017

Uh oh!

Choose a reason for hiding this comment

sanimej Jun 28, 2017 •

edited

Loading

hjdr4 commented Aug 7, 2017 •

edited

Loading

codecov-io commented Jan 8, 2018 •

edited

Loading