Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing locks in agent and service code #1570

Merged
merged 1 commit into from
Nov 30, 2016
Merged

Add missing locks in agent and service code #1570

merged 1 commit into from
Nov 30, 2016

Conversation

aboch
Copy link
Contributor

@aboch aboch commented Nov 22, 2016

Related to moby/moby#28697, moby/moby#28845 and moby/moby#28712,

Signed-off-by: Alessandro Boch [email protected]

@mavenugo
Copy link
Contributor

@aboch could you also take care of other missing locks for moby/moby#28712 ?

I think we need a sweep of the swarm related changes in libnetwork and look for missing locks and other concurrency issues. cc @sanimej

@aboch
Copy link
Contributor Author

aboch commented Nov 22, 2016

@mavenugo Sure thing, I was not aware of that issue.

@aboch aboch changed the title Add missing locks in service code Add missing locks in agent and service code Nov 22, 2016
@aboch
Copy link
Contributor Author

aboch commented Nov 29, 2016

ping @mavenugo @sanimej This is ready for review

@@ -44,6 +44,8 @@ func (l *logWriter) Write(p []byte) (int, error) {

// SetKey adds a new key to the key ring
func (nDB *NetworkDB) SetKey(key []byte) {
nDB.Lock()
defer nDB.Unlock()
logrus.Debugf("Adding key %s", hex.EncodeToString(key)[0:5])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging should be outside of the lock. Same comment for SetPrimaryKey and RemoveKey as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally put it inside the lock so that it gets printed when this function effectively does the job

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not the purpose of lock. After printing the debug if the lock can't be acquired the go routine is going to be stuck. If its a deadlock the stack decode will capture the state when the user dumps it.

@@ -136,6 +144,7 @@ func (c *controller) handleKeyChange(keys []*types.EncryptionKey) error {
}
}
}
c.Unlock()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling a.networkDB.SetKey(key.Key) inside the controller lock doesn't look correct; its not necessary. Also,SetKey takes nDB.Lock which can potentially lead to deadlocs.. We should save the new key and call a.networkDB.SetKey(key.Key) outside of the controller lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it deadlock ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concurrent go routines, one takes c.Lock() and tries nDB.Lock(); second go routine takes nDB.Lock() and tries c.Lock(). We had such deadlocks before, between network and endpoint locks IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand, but the nDB object does not have any reference to the agent or controller. It's a leaf object. From what I see, networkDB methods are properly coded so that they are the only one to exercise the nDB lock over the nDB data.

I don't think the situation you are describing can happen.
But I can take care of moving the function call outside of the controller lock, if that makes it less confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care of it. PTAL

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, networkdb is currently a separate package without any dependency on libnetwork core. But its safer to avoid nested locks unless its really required. In this case there is no need to call networkDB.SetKey under the controller lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes. I agree, being conservative when it comes to holding locks is preferred. Especially when we have locks around a function call (with closely related code), it is much better to avoid it.

@@ -339,11 +363,13 @@ func (c *controller) agentClose() {
return
}

agent.Lock()
Copy link

@sanimej sanimej Nov 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get the slice of agent.driverCancelFuncs under the lock and do the actual invocation outside of the lock ? Only the access to agent.driverCancelFuncs is racy here. But we are locking around the execution of those functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take care of it. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care of it. PTAL

@sanimej
Copy link

sanimej commented Nov 29, 2016

Thanks @aboch. LGTM

@@ -127,7 +136,7 @@ func (c *controller) handleKeyChange(keys []*types.EncryptionKey) error {
if !same {
c.keys = append(c.keys, key)
if key.Subsystem == subsysGossip {
a.networkDB.SetKey(key.Key)
added = key.Key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am not too confident on this code-path. So consider this as a question... It looks like SetKey can be potentially called multiple times based on how many unique keys of type "subsysGossip" is present ?
With this change, it seems to be assumed that it will be only once and it is cached in added variable which is later used to set Key again. Is that the expectation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it follows the existing logic for the deleted key (look for the deleted variable).
This is the code path for the key rotation, where we know there will only be one deleted and one added key. But @sanimej can correct us here, he initially wrote this logic.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mavenugo What aboch said is correct.The key rotation logic adds one new key to the set and removes one. So before this change also only one key would change on a rotation.

return n.ctrlr.agent.networkDB.Peers(n.id)
}
return []networkdb.PeerInfo{}
return agent.networkDB.Peers(n.ID())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to assume that networkDB will never be nil ?
The current code might behave this way. But why should we have such assumptions built in (which might cause panics in the future) ? I think it is safe to add the nil check.

Copy link
Contributor Author

@aboch aboch Nov 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is safe, because the networkDB field is instantiated at agent creation, and never reset.
This is why there is no check over agent.networkDB == nil throughout the agent code.

@mavenugo
Copy link
Contributor

LGTM

@mavenugo mavenugo merged commit 6cfa15e into moby:master Nov 30, 2016
@bklau
Copy link

bklau commented Dec 16, 2016

@mavenugo : Q: Does this locking changes effective on global/swarm scope or just local scope?
Thx

@mavenugo
Copy link
Contributor

@bklau it should be effective mostly for global/swarm scoped networks.

@bklau
Copy link

bklau commented Dec 16, 2016

@mavenugo : Is my understanding correct that if I have a globally-scoped overlay network based on Consul, say, (not 1.12 Swarm mode); then if I issued two network overlay to create/delete the SAME network "my_net" concurrently, a global lock would be contested and held first before doing any create/delete for "my_net"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants