Conversation
2fce4a5 to
00a9f7b
Compare
00a9f7b to
00358de
Compare
Codecov Report
@@ Coverage Diff @@
## master #1705 +/- ##
==========================================
+ Coverage 56.03% 56.08% +0.05%
==========================================
Files 606 606
Lines 80293 80333 +40
==========================================
+ Hits 44989 45057 +68
+ Misses 31769 31737 -32
- Partials 3535 3539 +4
Continue to review full report at Codecov.
|
00358de to
a04b020
Compare
|
Coverage from tests in |
a04b020 to
99a2d7f
Compare
|
|
||
| aWallets atomic.Value | ||
|
|
||
| wg sync.WaitGroup |
There was a problem hiding this comment.
Given the amount of properties backend already has, I'd consider adding a much more descriptive variable name for this waitgroup (and a small comment explaining why it may be needed). closeReadyWG, or something along those lines
There was a problem hiding this comment.
So it looks like we may just be able to remove this wg, since the go-routine it was waiting for doesn't appear to be necessary.
|
|
||
| quit chan struct{} // blockchain quit channel | ||
| wg sync.WaitGroup // chain processing wait group for shutting down | ||
| wgMu sync.Mutex // chain processing wait group for shutting down |
There was a problem hiding this comment.
Is this mutex intended to make sure the Wait happens after the first Add? AFAIK sync.WaitGroup is thread safe in its use and there shouldn't be any need to syncronize its access, but I may be missing some border case
There was a problem hiding this comment.
This mutex ensures that wait and add cannot be called concurrently see - golang/go#23842
There was a problem hiding this comment.
Ah, right, it's thread safe if you call Add when the count is already > 0, but if it's exactly 0 then it's not thread safe. Either a broken API or a weird use by ourselves.
Given that in this particular use case Add should never be called from a different goroutine without the other lock, I think we better encapsulate this in a new type (DynamicWaitGroup or whatever name you can think of) and have its own internal mutex, to avoid adding one more reference to the blockchain type, and also making sure that any new code written here cannot accidentally call the Add without the lock.
There was a problem hiding this comment.
Good suggestion, but I've actually cherry picked a fix from upstream that was merged during the lifetime of this PR, because I think that is going to cause less problems when merging.
|
|
||
| db.lock.RLock() | ||
| node, ok := db.dirties[hash] | ||
| if !ok { |
There was a problem hiding this comment.
Isn't this the same as moving the RUnlock call right after the db.dirties[hash] read?
There was a problem hiding this comment.
Yes it is :) thank you !
|
I'm confused about some changes. Are the |
Hey @hbandura they are meant to be part of this pr. I think the commit comments make this clear.
And
I would suggest just checking this PR commit by commit for a better understanding of all the individual changes that it contains. We could still split these changes out, if you think that would be better. I generally try to keep changes with the thing that precipitated the changes, since that makes it clear why the changes are needed. Otherwise you end up with PRs having changes that don't seem to serve any purpose. That's why I kept all these changes together because they are all needed for this test to work. |
|
Coverage from tests in coverage: 51.4% of statements across all listed packagescoverage: 64.1% of statements in consensus/istanbul coverage: 51.7% of statements in consensus/istanbul/backend coverage: 0.0% of statements in consensus/istanbul/backend/backendtest coverage: 50.0% of statements in consensus/istanbul/backend/internal/db coverage: 46.0% of statements in consensus/istanbul/backend/internal/enodes coverage: 24.3% of statements in consensus/istanbul/backend/internal/replica coverage: 70.8% of statements in consensus/istanbul/core coverage: 0.0% of statements in consensus/istanbul/proxy coverage: 75.3% of statements in consensus/istanbul/uptime coverage: 100.0% of statements in consensus/istanbul/uptime/store coverage: 51.8% of statements in consensus/istanbul/validator coverage: 79.2% of statements in consensus/istanbul/validator/randomCommentID: 9bd7c06e8e |
| } | ||
| } | ||
| go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State) | ||
| sb.wg.Add(1) |
There was a problem hiding this comment.
Doesn't this waitgroup face the same problem with the concurrent Add ? or are Commit and Close both called from the same thread?
There was a problem hiding this comment.
Well spotted, this is not safe. I'm thinking to just remove the use of the go-routine. It doesn't seem to affect the tests and it makes the code simpler.
hbandura
left a comment
There was a problem hiding this comment.
Approving it, but with some comments, feel free to address them
| } | ||
|
|
||
| // Close locks the mutex, then closes it. | ||
| func (cm *ClosableMutex) Close() { |
There was a problem hiding this comment.
Should Close work if the mutex is locked?
There was a problem hiding this comment.
Hey @hernan this mutex was added by the go-ethereum team, its part of the fix I cherry-picked from upstream.
So they would know best, but my understanding is that it should wait to acquire the lock before closing the channel, since closing the channel prior to obtaining the lock could result in calls to Unlock panicking.
| @@ -0,0 +1,64 @@ | |||
| // Copyright 2021 The go-ethereum Authors | |||
There was a problem hiding this comment.
This is the fix cherry picked from go-ethereum
| } | ||
| go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State) | ||
|
|
||
| sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State) |
There was a problem hiding this comment.
Is running this on a goroutine no longer necessary due to the other syncronization added in worker?
There was a problem hiding this comment.
I don't know if it ever was necessary. All I did was remove the go routine and nothing broke.
| var cancel context.CancelFunc | ||
| var wg sync.WaitGroup | ||
|
|
||
| // Ensure that block construction is complete before exiting this function. |
There was a problem hiding this comment.
expand on why block construction is needed before exiting, if possible
There was a problem hiding this comment.
I think what I meant there was just that we wait for all started go routines to close before exiting, and its the process of block construction that is starting those go-routines, but its not actually important that block construction has finished, just that all started go routines have finished.
So what about ?
// Ensure that all started go-routines finish before exiting this function.
49c2f9d to
ff1dd46
Compare
In order to run small scale e2e tests reliably we need to be able to let nodes sync from just one other node. This commit allows the minimum sync peers to be configured, where previously it was hardcoded to 5.
Now dialHistoryExpiration and inboundThrottleTime can be configured via config. This allows for the writing of fast tests.
Ensure that Add is called before Wait.
backend.Backend.Commit was starting a goroutine but not waiting for it to complete. This commit adds a wait group to wait for the goroutine to complete when backend.Backend is closed.
Also make miner wait for goroutines to end at Shutdown.
Ensure Add is called before Wait
core.current was being read via calls to core.ParentCommits from the
worker's main loop, and written via calls to core.Stop coming from the
miner's update loop.
The fix is to add a mutext to synchronise access to current.
==================
WARNING: DATA RACE
Write at 0x00c00114dd98 by goroutine 329:
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193
github.com/celo-org/celo-blockchain/miner.(*worker).stop()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb
github.com/celo-org/celo-blockchain/miner.(*Miner).update()
/home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb
Previous read at 0x00c00114dd98 by goroutine 471:
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).ParentCommits()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:204 +0x4a
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:989 +0x38a
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404
github.com/celo-org/celo-blockchain/miner.prepareBlock()
/home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f
github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84
github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c
core.resetResendRoundChangeTimer was being called from the main engine
go-routine and core.stopResendRoundChangeTimer was being called via
core.Stop from miner's update go-routine. Both accessed
resendRoundChangeMessageTimer without syncronisation.
This commit adds a mutex to ensure synchronised access to
resendRoundChangeMessageTimer.
==================
WARNING: DATA RACE
Read at 0x00c01377f568 by goroutine 72:
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopResendRoundChangeTimer()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:710 +0x47
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopAllTimers()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:719 +0x54
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:61 +0x44
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193
github.com/celo-org/celo-blockchain/miner.(*worker).stop()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb
github.com/celo-org/celo-blockchain/miner.(*Miner).update()
/home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb
Previous write at 0x00c01377f568 by goroutine 312:
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resetResendRoundChangeTimer()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:780 +0x368
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resendRoundChangeMessage()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:795 +0x7b
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleResendRoundChangeEvent()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:247 +0x353
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleEvents()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:143 +0x77e
The current round state of the engine was being read without
synchronisation via core.CurrentView by the miner the current round
state was also being written in the engines stop function.
The fix is to use the pre-existing mutex to syncronise accesses through
CurrentView.
==================
WARNING: DATA RACE
Read at 0x00c000bb20b8 by goroutine 506:
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).CurrentView()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:195 +0x4a
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.waitCoreToReachSequence()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1203 +0x24c
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:982 +0x130
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404
github.com/celo-org/celo-blockchain/miner.prepareBlock()
/home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f
github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84
github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c
Previous write at 0x00c000bb20b8 by goroutine 93:
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193
github.com/celo-org/celo-blockchain/miner.(*worker).stop()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb
github.com/celo-org/celo-blockchain/miner.(*Miner).update()
/home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb
The eth.chainSyncer main loop periodically calls
eth.chainSyncer.startSync which fires off a sync operation in its own
goroutine which will eventually result in Downloader.Cancel being
called.
The eth.chainSyncer also calls Downloader.Cancel when exiting its main
loop.
This resulted in a data race on Downloader.ancientLimit, the solution
was to add a mutex to lock over Downloader.ancientLimit in
Downloader.Cancel.
==================
WARNING: DATA RACE
Write at 0x00c01a0e3ce0 by goroutine 466:
github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel()
/home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68
github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).synchronise()
/home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:444 +0x601
github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Synchronise()
/home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:354 +0xb5
github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).doSync()
/home/pierspowlesland/projects/celo-blockchain/eth/sync.go:327 +0x1ad
github.com/celo-org/celo-blockchain/eth.(*chainSyncer).startSync.func1()
/home/pierspowlesland/projects/celo-blockchain/eth/sync.go:303 +0x59
Previous write at 0x00c01a0e3ce0 by goroutine 182:
github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel()
/home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68
github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Terminate()
/home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:635 +0xcc
github.com/celo-org/celo-blockchain/eth.(*chainSyncer).loop()
/home/pierspowlesland/projects/celo-blockchain/eth/sync.go:232 +0x65b
p2p.PurposeFlag is not threadsafe but was being accessed from many
go-routines without syncronisation via p2p.Peer.HasPurpose.
The solution was to lock over the access in HasPurpose using the mutex
that already existed to synchronise access to a peer's purposes flag.
==================
WARNING: DATA RACE
Read at 0x00c001520680 by goroutine 291:
github.com/celo-org/celo-blockchain/p2p.(*Peer).HasPurpose()
/home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:167 +0x5fd
github.com/celo-org/celo-blockchain/eth.(*peer).PurposeIsSet()
/home/pierspowlesland/projects/celo-blockchain/eth/peer.go:696 +0x516
github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).FindPeers()
/home/pierspowlesland/projects/celo-blockchain/eth/handler.go:1055 +0x612
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*validatorPeerHandler).ReplaceValidatorPeers()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/peer_handler.go:147 +0x201
github.com/celo-org/celo-blockchain/consensus/istanbul/backend/internal/enodes.(*ValidatorEnodeDB).RefreshValPeers()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/internal/enodes/val_enode_db.go:480 +0x641
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).RefreshValPeers()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/backend.go:846 +0x264
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StartValidating()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:691 +0x3a5
github.com/celo-org/celo-blockchain/miner.(*worker).start()
/home/pierspowlesland/projects/celo-blockchain/miner/worker.go:210 +0x1a2
github.com/celo-org/celo-blockchain/miner.(*Miner).update()
/home/pierspowlesland/projects/celo-blockchain/miner/miner.go:161 +0x6a4
Previous write at 0x00c001520680 by goroutine 343:
github.com/celo-org/celo-blockchain/p2p.(*Peer).AddPurpose()
/home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:148 +0xd5
github.com/celo-org/celo-blockchain/p2p.(*Server).run.func3()
/home/pierspowlesland/projects/celo-blockchain/p2p/server.go:866 +0x2d1
github.com/celo-org/celo-blockchain/p2p.(*Server).run()
/home/pierspowlesland/projects/celo-blockchain/p2p/server.go:916 +0x241a
announceRunning is set in istanbul.Backend.StartAnnouncing and
istanbul.Backend.StopAnnouncing and also was being accessed without
syncronisation from within istanbul.Backend.newChainHead which is
executed repeatedly from within its own go-routine.
This commit read locks over announceRunning in
istanbul.Backend.newChainHead to avoid a data race.
Unfortunately we can't see both sides of this data race, it seems the
stack was corrupted somehow since the first stack trace cannot be
correct (how can fmt.Fscanf call TestStartStopValidators ?). Ive
shortened the stack with an elipsis because it was also very long.
==================
WARNING: DATA RACE
Write at 0x00c001448520 by goroutine 354:
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopAnnouncing()
/home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:749 +0x12b
github.com/celo-org/celo-blockchain/eth.(*Ethereum).stopAnnounce()
/home/pierspowlesland/projects/celo-blockchain/eth/backend.go:507 +0x2ad
github.com/celo-org/celo-blockchain/eth.(*Ethereum).Stop()
/home/pierspowlesland/projects/celo-blockchain/eth/backend.go:575 +0x2ae
github.com/celo-org/celo-blockchain/node.(*Node).stopServices()
/home/pierspowlesland/projects/celo-blockchain/node/node.go:309 +0x150
github.com/celo-org/celo-blockchain/node.(*Node).Close()
/home/pierspowlesland/projects/celo-blockchain/node/node.go:221 +0x1bb
github.com/celo-org/celo-blockchain/test.(*Node).Close()
/home/pierspowlesland/projects/celo-blockchain/test/node.go:312 +0x3b8
github.com/celo-org/celo-blockchain/test.Network.Shutdown()
/home/pierspowlesland/projects/celo-blockchain/test/node.go:498 +0x9a
runtime.call32()
/usr/local/go/src/runtime/asm_amd64.s:551 +0x3d
testing.(*T).FailNow()
<autogenerated>:1 +0x44
github.com/stretchr/testify/require.NoError()
/home/pierspowlesland/go/pkg/mod/github.com/stretchr/testify@v1.4.0/require/require.go:974 +0x104
github.com/celo-org/celo-blockchain/e2e_test_test.TestStartStopValidators()
/home/pierspowlesland/projects/celo-blockchain/e2e_test/e2e_test.go:168 +0x15ce
fmt.Fscanf()
/usr/local/go/src/fmt/scan.go:143 +0xee
fmt.Sscanf()
/usr/local/go/src/fmt/scan.go:114 +0x191
github.com/syndtr/goleveldb/leveldb/storage.fsParseName()
/home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/storage/file_storage.go:643 +0xa6
github.com/syndtr/goleveldb/leveldb/storage.(*fileStorage).List()
/home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/storage/file_storage.go:458 +0x309
github.com/syndtr/goleveldb/leveldb.(*DB).checkAndCleanFiles()
/home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/db_util.go:52 +0x301
github.com/syndtr/goleveldb/leveldb.openDB()
/home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/db.go:136 +0x9bb
fmt.(*ss).doScanf()
/usr/local/go/src/fmt/scan.go:1230 +0x411
fmt.Fscanf()
...
Previous read at 0x00c001448520 by goroutine 203:
[failed to restore the stack]
ff1dd46 to
6c36712
Compare
Removing the go-routine does not seem to have any effect on the tests.
It seems that very occasionally we are hitting this error condition and its not clear why.
…2853) This change removes misuses of sync.WaitGroup in BlockChain. Before this change, block insertion modified the WaitGroup counter in order to ensure that Stop would wait for pending operations to complete. This was racy and could even lead to crashes if Stop was called at an unfortunate time. The issue is resolved by adding a specialized 'closable' mutex, which prevents chain modifications after stopping while also synchronizing writers with each other. Co-authored-by: Felix Lange <fjl@twurst.com>
The start stop test would sometimes time out while nodes were stopped and the Network.Shutdown method would print errors relating to trying to stop a node twice. The shutdown mechanism has been updated to allow for more flexibility in deciding what errors are printed, allowing us to avoid printing double shutdown errors when we have manually stopped nodes.
6c36712 to
1ab51ee
Compare
Description
This PR adds a test that starts and stops nodes in a small network and verifies that the network can continue to function correctly throughout the process.
A quick overview of what the test does:
Other changes
This PR makes the minSyncPeers, dialHistoryExpiration and InboundThrottleTime configurable, in order for this test to run/run quickly these values needed to be set and they were previously hardcoded.
This PR also ensures correct shutdown of the miner, worker and backend.Backend, and fixes a number of data races/ race conditions. Each fix is described in the associated commit messages, it's probably easier to review this PR commit by commit, since these changes otherwise look unrelated.
I think this PR would be best merged rater than 'squashed and merged' due to the large number of commits containing unrelated fixes.
Upstream PRs that came out of this work
Tested
I managed to run this test 200 times on my local machine with no failures.
Related issues
#1642
Backwards compatibility
Yes