allocator: Fix panic when allocations happen at init time #1651

aaronlehmann · 2016-10-17T17:50:38Z

a.netCtx is initialized too late, so if allocations happen as part of
doNetworkInit, a nil pointer dereference will cause a panic.

Initialize a.netCtx earlier and use a.netCtx directly in member
functions instead of passing the network context separately, so there is
no confusion about which to use.

Also change allocator.go to have separate entries in the waitgroup for
initialization and actually running the allocator, and defer Done for
both. This should prevent a panic like this from leading to a deadlock,
since the deferred Done will be reached.

See moby/moby#25432

cc @mrjana @LK4D4 @tonistiigi

aaronlehmann · 2016-10-17T17:52:01Z

I think we should also make sure this scenario is covered in unit tests. I'm not very familiar with the tests for this part of the code, so it would be great if someone could point me in the right direction or submit a separate PR for test coverage.

LK4D4 · 2016-10-17T17:59:21Z

LGTM

mrjana · 2016-10-17T18:19:57Z

manager/allocator/network.go

@@ -81,6 +81,7 @@ func (a *Allocator) doNetworkInit(ctx context.Context) error {
 		unallocatedNetworks: make(map[string]*api.Network),
 		ingressNetwork:      newIngressNetwork(),
 	}
+	a.netCtx = nc


The reason a.netCtx is initialized at the end is to make sure in case of failures we don't return from doNetworkInit with a netCtx which shouldn't be there in the Allocator. This is because the Allocator has a longer life time than what happens in doNetworkInit itself.

So in all of doNetworkInit I've just passed the netCtx as an argument to functions that need it. May be we should do the same for taskCreateNetworkAttachments i.e add netCtx as an argument?

Sounds kind of dangerous TBH. What about a deferred closure in doNetworkInit that clears a.netCtx if an error is being returned?

Ok, I think a deferred closure for error handling should satisfy that requirement as well. I am good with that.

codecov-io · 2016-10-17T18:37:21Z

Current coverage is 56.54% (diff: 69.04%)

Merging #1651 into master will decrease coverage by 0.13%

@@             master      #1651   diff @@
==========================================
  Files            90         90          
  Lines         14551      14552     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           8248       8229    -19   
- Misses         5214       5233    +19   
- Partials       1089       1090     +1

Powered by Codecov. Last update 6179fcf...a8066c1

a.netCtx is initialized too late, so if allocations happen as part of doNetworkInit, a nil pointer dereference will cause a panic. Initialize a.netCtx earlier and use a.netCtx directly in member functions instead of passing the network context separately, so there is no confusion about which to use. Also change allocator.go to have separate entries in the waitgroup for initialization and actually running the allocator, and defer `Done` for both. This should prevent a panic like this from leading to a deadlock, since the deferred `Done` will be reached. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2016-10-17T20:45:47Z

@mrjana: Updated to add a closure that clears a.netCtx in the error case, and removed networkContext parameters from the functions so there's no confusion over whether to use the argument or a.netCtx.

mrjana

LGTM

mrjana · 2016-10-17T21:24:23Z

@aaronlehmann Adding a test case to cover this scenario would be nice. I will try to add one.

aluzzardi · 2016-10-18T21:13:43Z

@aaronlehmann @mrjana Please add a test case in a separate PR

To identify the issue in allocator. Signed-off-by: Madhu Venugopal <madhu@docker.com>

Also Cherry-pick moby/swarmkit#1651 to identify the issue in allocator. Signed-off-by: Madhu Venugopal <madhu@docker.com>

GordonTheTurtle added the status/0-triage label Oct 17, 2016

aaronlehmann added status/2-code-review and removed status/0-triage labels Oct 17, 2016

aaronlehmann added this to the 1.12.3 milestone Oct 17, 2016

aaronlehmann added the priority/P0 label Oct 17, 2016

mrjana reviewed Oct 17, 2016

View reviewed changes

aaronlehmann force-pushed the allocator-crash branch from dab9090 to a8066c1 Compare October 17, 2016 20:44

mrjana approved these changes Oct 17, 2016

View reviewed changes

aaronlehmann mentioned this pull request Oct 18, 2016

Protect against wedged managers #1658

Closed

aluzzardi merged commit f8ec492 into moby:master Oct 18, 2016

aaronlehmann deleted the allocator-crash branch October 18, 2016 23:38

aaronlehmann added the process/cherry-picked label Oct 19, 2016

mavenugo added a commit to mavenugo/docker that referenced this pull request Oct 19, 2016

Cherry-pick moby/swarmkit#1651

699f3ca

To identify the issue in allocator. Signed-off-by: Madhu Venugopal <madhu@docker.com>

aaronlehmann mentioned this pull request Oct 19, 2016

Vendor swarmkit for 1.12.3 moby/moby#27554

Merged

mavenugo added a commit to mavenugo/docker that referenced this pull request Oct 22, 2016

Vendoring swarmkit for TCP hack

746ce62

Also Cherry-pick moby/swarmkit#1651 to identify the issue in allocator. Signed-off-by: Madhu Venugopal <madhu@docker.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocator: Fix panic when allocations happen at init time #1651

allocator: Fix panic when allocations happen at init time #1651

aaronlehmann commented Oct 17, 2016 •

edited

Loading

aaronlehmann commented Oct 17, 2016

LK4D4 commented Oct 17, 2016

mrjana Oct 17, 2016

aaronlehmann Oct 17, 2016

mrjana Oct 17, 2016

codecov-io commented Oct 17, 2016 •

edited

Loading

aaronlehmann commented Oct 17, 2016

mrjana left a comment

mrjana commented Oct 17, 2016

aluzzardi commented Oct 18, 2016

allocator: Fix panic when allocations happen at init time #1651

allocator: Fix panic when allocations happen at init time #1651

Conversation

aaronlehmann commented Oct 17, 2016 • edited Loading

aaronlehmann commented Oct 17, 2016

LK4D4 commented Oct 17, 2016

mrjana Oct 17, 2016

Choose a reason for hiding this comment

aaronlehmann Oct 17, 2016

Choose a reason for hiding this comment

mrjana Oct 17, 2016

Choose a reason for hiding this comment

codecov-io commented Oct 17, 2016 • edited Loading

Current coverage is 56.54% (diff: 69.04%)

aaronlehmann commented Oct 17, 2016

mrjana left a comment

Choose a reason for hiding this comment

mrjana commented Oct 17, 2016

aluzzardi commented Oct 18, 2016

aaronlehmann commented Oct 17, 2016 •

edited

Loading

codecov-io commented Oct 17, 2016 •

edited

Loading