Skip to content

fix: discovery AyncFilter deadlock on shutdown#32572

Merged
cskiraly merged 1 commit intoethereum:masterfrom
zzzckck:fix_shutdown_hang
Oct 2, 2025
Merged

fix: discovery AyncFilter deadlock on shutdown#32572
cskiraly merged 1 commit intoethereum:masterfrom
zzzckck:fix_shutdown_hang

Conversation

@zzzckck
Copy link
Contributor

@zzzckck zzzckck commented Sep 10, 2025

Description

Hi, we found a occasionally node hang issue on BSC, I think Geth may also have the issue, so pick the fix patch here.
The fix on BSC repo: bnb-chain/bsc#3347

When the hang occurs, there are two routines stuck.

  • routine 1: AsyncFilter(...)
    On node start, it will run part of the DiscoveryV4 protocol, which could take considerable time, here is its hang callstack:
goroutine 9711 [chan receive]:  // this routine was stuck on read channel: `<-f.slots`
github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:206 +0x125
created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:192 +0x205

  • Routine 2: Node Stop
    It is the main routine to shutdown the process, but it got stuck when it tries to shutdown the discovery components, as it tries to drain the channel of <-f.slots, but the extra 1 slot will never have chance to be resumed.
goroutine 11796 [chan receive]: 
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:248 +0x5c
sync.(*Once).doSlow(0xc032a97cb8?, 0xc032a97d18?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close(0xc092ff8d00?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:244 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:299 +0x24
sync.(*Once).doSlow(0x11a175f?, 0x2bfe63e?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close(0x30?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:298 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*FairMix).Close(0xc0004bfea0)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:379 +0xb7
github.com/ethereum/go-ethereum/eth.(*Ethereum).Stop(0xc000997b00)
	github.com/ethereum/go-ethereum/eth/backend.go:960 +0x4a
github.com/ethereum/go-ethereum/node.(*Node).stopServices(0xc0001362a0, {0xc012e16330, 0x1, 0xc000111410?})
	github.com/ethereum/go-ethereum/node/node.go:333 +0xb3
github.com/ethereum/go-ethereum/node.(*Node).Close(0xc0001362a0)
	github.com/ethereum/go-ethereum/node/node.go:263 +0x167
created by github.com/ethereum/go-ethereum/cmd/utils.StartNode.func1.1 in goroutine 9729
	github.com/ethereum/go-ethereum/cmd/utils/cmd.go:101 +0x78

The rootcause of the hang is caused by the extra 1 slot, which was designed to make sure the routines in AsyncFilter(...) can be finished. This PR fixes it by making sure the extra 1 shot can always be resumed when node shutdown.

@zzzckck zzzckck changed the title fix: discovery AyncFilter deadlock on shutdown (#3347) fix: discovery AyncFilter deadlock on shutdown Sep 10, 2025
@cskiraly cskiraly self-requested a review September 29, 2025 08:06
Copy link
Contributor

@cskiraly cskiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rootcause of the hang is caused by the extra 1 slot, which was designed to make sure the routines in AsyncFilter(...) can be finished. This PR fixes it by making sure the extra 1 shot can always be resumed when node shutdown.

I don't see how the extra slot could cause the hang here. I think it should be either the check or the internal iterator. Still, the change is good and it seems to solve the issue.

@cskiraly cskiraly added this to the 1.16.5 milestone Oct 2, 2025
@cskiraly cskiraly merged commit f0dc47a into ethereum:master Oct 2, 2025
11 of 13 checks passed
Sahil-4555 pushed a commit to Sahil-4555/go-ethereum that referenced this pull request Oct 12, 2025
…2572)

Description:
We found a occasionally node hang issue on BSC, I think Geth may
also have the issue, so pick the fix patch here.
The fix on BSC repo: bnb-chain/bsc#3347

When the hang occurs, there are two routines stuck.
- routine 1: AsyncFilter(...)
On node start, it will run part of the DiscoveryV4 protocol, which could
take considerable time, here is its hang callstack:
```
goroutine 9711 [chan receive]:  // this routine was stuck on read channel: `<-f.slots`
github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:206 +0x125
created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:192 +0x205

```

- Routine 2: Node Stop
It is the main routine to shutdown the process, but it got stuck when it
tries to shutdown the discovery components, as it tries to drain the
channel of `<-f.slots`, but the extra 1 slot will never have chance to
be resumed.
```
goroutine 11796 [chan receive]: 
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:248 +0x5c
sync.(*Once).doSlow(0xc032a97cb8?, 0xc032a97d18?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close(0xc092ff8d00?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:244 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:299 +0x24
sync.(*Once).doSlow(0x11a175f?, 0x2bfe63e?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close(0x30?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:298 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*FairMix).Close(0xc0004bfea0)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:379 +0xb7
github.com/ethereum/go-ethereum/eth.(*Ethereum).Stop(0xc000997b00)
	github.com/ethereum/go-ethereum/eth/backend.go:960 +0x4a
github.com/ethereum/go-ethereum/node.(*Node).stopServices(0xc0001362a0, {0xc012e16330, 0x1, 0xc000111410?})
	github.com/ethereum/go-ethereum/node/node.go:333 +0xb3
github.com/ethereum/go-ethereum/node.(*Node).Close(0xc0001362a0)
	github.com/ethereum/go-ethereum/node/node.go:263 +0x167
created by github.com/ethereum/go-ethereum/cmd/utils.StartNode.func1.1 in goroutine 9729
	github.com/ethereum/go-ethereum/cmd/utils/cmd.go:101 +0x78
```

The rootcause of the hang is caused by the extra 1 slot, which was
designed to make sure the routines in `AsyncFilter(...)` can be
finished. This PR fixes it by making sure the extra 1 shot can always be
resumed when node shutdown.
atkinsonholly pushed a commit to atkinsonholly/ephemery-geth that referenced this pull request Nov 24, 2025
…2572)

Description:
We found a occasionally node hang issue on BSC, I think Geth may
also have the issue, so pick the fix patch here.
The fix on BSC repo: bnb-chain/bsc#3347

When the hang occurs, there are two routines stuck.
- routine 1: AsyncFilter(...)
On node start, it will run part of the DiscoveryV4 protocol, which could
take considerable time, here is its hang callstack:
```
goroutine 9711 [chan receive]:  // this routine was stuck on read channel: `<-f.slots`
github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:206 +0x125
created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:192 +0x205

```

- Routine 2: Node Stop
It is the main routine to shutdown the process, but it got stuck when it
tries to shutdown the discovery components, as it tries to drain the
channel of `<-f.slots`, but the extra 1 slot will never have chance to
be resumed.
```
goroutine 11796 [chan receive]: 
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:248 +0x5c
sync.(*Once).doSlow(0xc032a97cb8?, 0xc032a97d18?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close(0xc092ff8d00?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:244 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:299 +0x24
sync.(*Once).doSlow(0x11a175f?, 0x2bfe63e?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close(0x30?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:298 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*FairMix).Close(0xc0004bfea0)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:379 +0xb7
github.com/ethereum/go-ethereum/eth.(*Ethereum).Stop(0xc000997b00)
	github.com/ethereum/go-ethereum/eth/backend.go:960 +0x4a
github.com/ethereum/go-ethereum/node.(*Node).stopServices(0xc0001362a0, {0xc012e16330, 0x1, 0xc000111410?})
	github.com/ethereum/go-ethereum/node/node.go:333 +0xb3
github.com/ethereum/go-ethereum/node.(*Node).Close(0xc0001362a0)
	github.com/ethereum/go-ethereum/node/node.go:263 +0x167
created by github.com/ethereum/go-ethereum/cmd/utils.StartNode.func1.1 in goroutine 9729
	github.com/ethereum/go-ethereum/cmd/utils/cmd.go:101 +0x78
```

The rootcause of the hang is caused by the extra 1 slot, which was
designed to make sure the routines in `AsyncFilter(...)` can be
finished. This PR fixes it by making sure the extra 1 shot can always be
resumed when node shutdown.
prestoalvarez pushed a commit to prestoalvarez/go-ethereum that referenced this pull request Nov 27, 2025
…2572)

Description:
We found a occasionally node hang issue on BSC, I think Geth may
also have the issue, so pick the fix patch here.
The fix on BSC repo: bnb-chain/bsc#3347

When the hang occurs, there are two routines stuck.
- routine 1: AsyncFilter(...)
On node start, it will run part of the DiscoveryV4 protocol, which could
take considerable time, here is its hang callstack:
```
goroutine 9711 [chan receive]:  // this routine was stuck on read channel: `<-f.slots`
github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:206 +0x125
created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:192 +0x205

```

- Routine 2: Node Stop
It is the main routine to shutdown the process, but it got stuck when it
tries to shutdown the discovery components, as it tries to drain the
channel of `<-f.slots`, but the extra 1 slot will never have chance to
be resumed.
```
goroutine 11796 [chan receive]: 
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:248 +0x5c
sync.(*Once).doSlow(0xc032a97cb8?, 0xc032a97d18?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*asyncFilterIter).Close(0xc092ff8d00?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:244 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:299 +0x24
sync.(*Once).doSlow(0x11a175f?, 0x2bfe63e?)
	sync/once.go:78 +0xab
sync.(*Once).Do(...)
	sync/once.go:69
github.com/ethereum/go-ethereum/p2p/enode.(*bufferIter).Close(0x30?)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:298 +0x36
github.com/ethereum/go-ethereum/p2p/enode.(*FairMix).Close(0xc0004bfea0)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:379 +0xb7
github.com/ethereum/go-ethereum/eth.(*Ethereum).Stop(0xc000997b00)
	github.com/ethereum/go-ethereum/eth/backend.go:960 +0x4a
github.com/ethereum/go-ethereum/node.(*Node).stopServices(0xc0001362a0, {0xc012e16330, 0x1, 0xc000111410?})
	github.com/ethereum/go-ethereum/node/node.go:333 +0xb3
github.com/ethereum/go-ethereum/node.(*Node).Close(0xc0001362a0)
	github.com/ethereum/go-ethereum/node/node.go:263 +0x167
created by github.com/ethereum/go-ethereum/cmd/utils.StartNode.func1.1 in goroutine 9729
	github.com/ethereum/go-ethereum/cmd/utils/cmd.go:101 +0x78
```

The rootcause of the hang is caused by the extra 1 slot, which was
designed to make sure the routines in `AsyncFilter(...)` can be
finished. This PR fixes it by making sure the extra 1 shot can always be
resumed when node shutdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants