node: close ledger and part keys on node shutdown#6039
Merged
algorandskiy merged 3 commits intoalgorand:masterfrom Jun 25, 2024
Merged
node: close ledger and part keys on node shutdown#6039algorandskiy merged 3 commits intoalgorand:masterfrom
algorandskiy merged 3 commits intoalgorand:masterfrom
Conversation
* node_test runs multiple nodes in a single process that leads to file descriptors leak (see algorand#5057 for more details). * just closing ledger is not enough because of concurrent operations evaluation operations done by transaction pool. * made transaction pool shutdown-able and stop it before ledger termination.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6039 +/- ##
==========================================
- Coverage 55.86% 54.84% -1.03%
==========================================
Files 482 482
Lines 68576 68593 +17
==========================================
- Hits 38311 37617 -694
- Misses 27659 28325 +666
- Partials 2606 2651 +45 ☔ View full report in Codecov by Sentry. |
gmalouf
reviewed
Jun 25, 2024
Contributor
gmalouf
left a comment
There was a problem hiding this comment.
Asked two questions, other than that makes sense.
cce
reviewed
Jun 25, 2024
cce
previously approved these changes
Jun 25, 2024
cce
approved these changes
Jun 25, 2024
gmalouf
approved these changes
Jun 25, 2024
cce
reviewed
Jun 25, 2024
algorandskiy
added a commit
to algorandskiy/go-algorand
that referenced
this pull request
Jun 26, 2024
* algorand#6039 discovered blockNotifier preserves state between ledger reloads that breaks assumptions on how trackers work * Made node to clear and re-register block listeners explicitly on fast catchup. * Also removed unused blockListeners argument from data.Ledger
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Deadlock example as seen in this p2p build
POTENTIAL DEADLOCK: Previous place where the lock was grabbed goroutine 412 lock 0xc0006fe300 node.go:446 node.(*AlgorandFullNode).Stop { node.mu.Lock() } <<<<< node_test.go:938 node.TestNodeHybridTopology { } }Have been trying to lock it again for more than 30s
goroutine 17114 lock 0xc0006fe300
node.go:1125 node.(*AlgorandFullNode).oldKeyDeletionThread { node.mu.Lock() } <<<<<
Here is what goroutine 412 doing now
goroutine 412 [semacquire]:
sync.runtime_Semacquire(0xc0094269b0?)
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0094269a8)
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/sync/waitgroup.go:116 +0xa5
github.com/algorand/go-algorand/ledger.(*blockNotifier).close(0xc009426958)
/opt/cibuild/project/ledger/notifier.go:82 +0x97
github.com/algorand/go-algorand/ledger.(*trackerRegistry).close(0xc009426a80)
/opt/cibuild/project/ledger/tracker.go:506 +0x104
github.com/algorand/go-algorand/ledger.(*Ledger).Close(0xc009426000)
/opt/cibuild/project/ledger/ledger.go:416 +0x111
github.com/algorand/go-algorand/node.(*AlgorandFullNode).Stop(0xc0006fe300)
/opt/cibuild/project/node/node.go:473 +0x717
github.com/algorand/go-algorand/node.TestNodeHybridTopology(0xc006d53860)
/opt/cibuild/project/node/node_test.go:938 +0xe2a
testing.tRunner(0xc006d53860, 0x30e34c8)
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/testing/testing.go:1595 +0x262
created by testing.(*T).Run in goroutine 1
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/testing/testing.go:1648 +0x846
Other goroutines holding locks:
goroutine 617 lock 5a4
../data/pools/transactionPool.go:530 pools.(*TransactionPool).OnNewBlock { pool.mu.Lock() } <<<<<
../ledger/notifier.go:67 ledger.(*blockNotifier).worker { listener.OnNewBlock(blk.block, blk.delta) }
goroutine 617 lock 3f
../data/pools/transactionPool.go:781 pools.(*TransactionPool).recomputeBlockEvaluator { pool.assemblyMu.Lock() } <<<<<
../data/pools/transactionPool.go:560 pools.(*TransactionPool).OnNewBlock { stats = pool.recomputeBlockEvaluator(committedTxids, knownCommitted) }
../ledger/notifier.go:67 ledger.(*blockNotifier).worker { listener.OnNewBlock(blk.block, blk.delta) }
goroutine 412 lock 32
../ledger/ledger.go:412 ledger.(*Ledger).Close { l.trackerMu.Lock() } <<<<<
node.go:473 node.(*AlgorandFullNode).Stop { node.ledger.Close() }
node_test.go:938 node.TestNodeHybridTopology { } }
POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 412 lock 0xc009426fe8
../ledger/ledger.go:412 ledger.(*Ledger).Close { l.trackerMu.Lock() } <<<<<
node.go:473 node.(*AlgorandFullNode).Stop { node.ledger.Close() }
node_test.go:938 node.TestNodeHybridTopology { } }
Have been trying to lock it again for more than 30s
goroutine 617 lock 0xc009426fe8
../ledger/ledger.go:636 ledger.(*Ledger).LookupWithoutRewards { l.trackerMu.RLock() } <<<<<
../ledger/eval/eval.go:200 eval.(*roundCowBase).lookup { ad, _, err := x.l.LookupWithoutRewards(x.rnd, addr) }
../ledger/eval/cow.go:186 eval.(*roundCowState).lookup { return cb.lookupParent.lookup(addr) }
../ledger/eval/eval.go:1910 eval.(*BlockEvaluator).GenerateBlock { acct, err := eval.state.lookup(addrs[i]) }
../data/pools/transactionPool.go:794 pools.(*TransactionPool).recomputeBlockEvaluator { lvb, err := pool.pendingBlockEvaluator.GenerateBlock(pool.getVotingAccountsForRound(evalRnd)) }
../data/pools/transactionPool.go:560 pools.(*TransactionPool).OnNewBlock { stats = pool.recomputeBlockEvaluator(committedTxids, knownCommitted) }
../ledger/notifier.go:67 ledger.(*blockNotifier).worker { listener.OnNewBlock(blk.block, blk.delta) }
Additionally added part keys closing.
Monitored with
while pgrep node.test > /dev/null; do lsof -p $(pgrep node.test) | wc -l; sleep 1; doneon p2p branch:Test Plan
Existing tests should pass