Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential deadlock between Gossip.SetStorage and Node.gossipStores (inconsistent lock order) #7972

Closed
sasha-s opened this issue Jul 22, 2016 · 2 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@sasha-s
Copy link

sasha-s commented Jul 22, 2016

It seems that Gossip.SetStorage and Node.gossipStores grab mutexes in different order.

My understanding is:

  • Gossip.SetStorage grabs gossip lock, calls ReadBootstrapInfo which grabs stores rlock.
  • Node.gossipStores calls VisitStores, which grabs stores rlock, in the closure: Gossip.AddInfoProto->AddInfo, which grabs gossip lock.

Steps to reporoduce (@5151da6461852f57785db18511efd8ccf918b39c), using go-deadlock:

go get github.com/sasha-s/go-deadlock/...
cd src/github.com/cockroachdb
find . -name "*.go" | xargs -n 1 sed -i '' 's/sync.RWMutex/deadlock.RWMutex/'
find . -name "*.go" | xargs -n 1 sed -i '' 's/sync.Mutex/deadlock.Mutex/'
goimports -w .
cd cockroach
make acceptance TESTS='' TESTFLAGS='-v -d 1200s -l .' TESTTIMEOUT=1210s

look into acceptance/TestDockerC/roach_/stderr._

Stack traces from go-deadlock:

POTENTIAL DEADLOCK: Inconsistent locking. saw this ordering in one goroutine:
happened before
src/github.com/cockroachdb/cockroach/gossip/gossip.go:261 gossip.(*Gossip).SetStorage { g.mu.Lock() } <<<<<
src/github.com/cockroachdb/cockroach/server/node.go:437 server.(*Node).initStores { if err := n.ctx.Gossip.SetStorage(n.stores); err != nil { }
src/github.com/cockroachdb/cockroach/server/node.go:332 server.(*Node).start { if err := n.initStores(engines, n.stopper); err != nil { }
src/github.com/cockroachdb/cockroach/server/server.go:402 server.(*Server).Start { if err := s.node.start(unresolvedAddr, s.ctx.Engines, s.ctx.NodeAttributes); err != nil { }
src/github.com/cockroachdb/cockroach/cli/start.go:341 cli.runStart { if err := s.Start(); err != nil { }
src/github.com/spf13/cobra/command.go:600 cobra.(*Command).execute ???
src/github.com/spf13/cobra/command.go:690 cobra.(*Command).ExecuteC ???
src/github.com/spf13/cobra/command.go:649 cobra.(*Command).Execute ???
src/github.com/cockroachdb/cockroach/cli/cli.go:95 cli.Run { return cockroachCmd.Execute() }
src/github.com/cockroachdb/cockroach/main.go:37 main.main { if err := cli.Run(os.Args[1:]); err != nil { }
/usr/local/go/src/runtime/proc.go:188 runtime.main { main_main() }

happened after
src/github.com/cockroachdb/cockroach/storage/stores.go:279 storage.(*Stores).ReadBootstrapInfo { ls.mu.RLock() } <<<<<
src/github.com/cockroachdb/cockroach/gossip/gossip.go:267 gossip.(*Gossip).SetStorage { err := storage.ReadBootstrapInfo(&storedBI) }
src/github.com/cockroachdb/cockroach/server/node.go:437 server.(*Node).initStores { if err := n.ctx.Gossip.SetStorage(n.stores); err != nil { }
src/github.com/cockroachdb/cockroach/server/node.go:332 server.(*Node).start { if err := n.initStores(engines, n.stopper); err != nil { }
src/github.com/cockroachdb/cockroach/server/server.go:402 server.(*Server).Start { if err := s.node.start(unresolvedAddr, s.ctx.Engines, s.ctx.NodeAttributes); err != nil { }
src/github.com/cockroachdb/cockroach/cli/start.go:341 cli.runStart { if err := s.Start(); err != nil { }
src/github.com/spf13/cobra/command.go:600 cobra.(*Command).execute ???
src/github.com/spf13/cobra/command.go:690 cobra.(*Command).ExecuteC ???
src/github.com/spf13/cobra/command.go:649 cobra.(*Command).Execute ???
src/github.com/cockroachdb/cockroach/cli/cli.go:95 cli.Run { return cockroachCmd.Execute() }
src/github.com/cockroachdb/cockroach/main.go:37 main.main { if err := cli.Run(os.Args[1:]); err != nil { }
/usr/local/go/src/runtime/proc.go:188 runtime.main { main_main() }

in another goroutine: happened before
src/github.com/cockroachdb/cockroach/storage/stores.go:118 storage.(*Stores).VisitStores { ls.mu.RLock() } <<<<<
src/github.com/cockroachdb/cockroach/server/node.go:588 server.(*Node).gossipStores { if err := n.stores.VisitStores(func(s *storage.Store) error { }
src/github.com/cockroachdb/cockroach/server/node.go:568 server.(*Node).startGossip.func1 { n.gossipStores() // one-off run before going to sleep }
src/github.com/cockroachdb/cockroach/util/stop/stopper.go:180 stop.(*Stopper).RunWorker.func1 { f() }

happend after
src/github.com/cockroachdb/cockroach/gossip/gossip.go:545 gossip.(*Gossip).AddInfo { g.mu.Lock() } <<<<<
src/github.com/cockroachdb/cockroach/gossip/gossip.go:561 gossip.(*Gossip).AddInfoProto { return g.AddInfo(key, bytes, ttl) }
src/github.com/cockroachdb/cockroach/storage/store.go:1341 storage.(*Store).GossipStore { if err := s.ctx.Gossip.AddInfoProto(gossipStoreKey, storeDesc, ttlStoreGossip); err != nil { }
src/github.com/cockroachdb/cockroach/server/node.go:589 server.(*Node).gossipStores.func1 { s.GossipStore() }
src/github.com/cockroachdb/cockroach/storage/stores.go:121 storage.(*Stores).VisitStores { if err := visitor(s); err != nil { }
src/github.com/cockroachdb/cockroach/server/node.go:588 server.(*Node).gossipStores { if err := n.stores.VisitStores(func(s *storage.Store) error { }
src/github.com/cockroachdb/cockroach/server/node.go:568 server.(*Node).startGossip.func1 { n.gossipStores() // one-off run before going to sleep }
src/github.com/cockroachdb/cockroach/util/stop/stopper.go:180 stop.(*Stopper).RunWorker.func1 { f() }
@bdarnell
Copy link
Contributor

go-deadlock looks pretty cool. @arjunravinarayan and I were just talking about replacing direct use of sync.Mutex with a wrapper that could add extra instrumentation (we were thinking of the AssertHeld method from google's C++ mutex implementation); maybe we could set up support for go-deadlock behind a build tag and incorporate this into our testing process.

@tbg
Copy link
Member

tbg commented Jul 24, 2016

+1
Alone seeing the goroutine which has everyone else sit on semacquire is
going to save us time.
We could also jitter (or better) lock durations, though I don't know
whether it would turn up much.

On Sun, Jul 24, 2016, 23:31 Ben Darnell [email protected] wrote:

go-deadlock looks pretty cool. @arjunravinarayan
https://github.com/arjunravinarayan and I were just talking about
replacing direct use of sync.Mutex with a wrapper that could add extra
instrumentation (we were thinking of the AssertHeld method from google's
C++ mutex implementation); maybe we could set up support for go-deadlock
behind a build tag and incorporate this into our testing process.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7972 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE135GSSD3MNNfvzJ5Y_U0jVxmoFl-PXks5qY9mdgaJpZM4JSV0j
.

-- Tobias

@petermattis petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 25, 2016
@petermattis petermattis added this to the Q3 milestone Jul 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

5 participants