Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add stream metrics #136

Merged
merged 2 commits into from
Apr 14, 2020
Merged

add stream metrics #136

merged 2 commits into from
Apr 14, 2020

Conversation

stefantalpalaru
Copy link
Contributor

  • just BufferStream and Connection are tracked, for now
  • flag checking is enforced more strictly in close(), since it became clear that instances are closed multiple times

The two metrics in Grafana, for the bootstrap node, while running eth2_network_simulation and starting/stopping an external node a couple of times:

grafana

Weird behaviour:

  • initial mismatch: we're starting with 30 Connection instances, but only 25 BufferStream ones. Early leak?
  • strange inversion after starting the external node the first time: why are there more BufferStream instances than Connection ones?
  • Connection leak after killing the external node: 10 Connection instances were not closed
  • large number of connections created the second time the external node was started: why are we opening 10 Connection and BufferStream instances for just one node? Is the amount of new connections in direct relation with the number of blocks synced (the first time, 16 new ones were opened)? Aren't syncing connections supposed to be closed before the peer disconnects?
  • 4 out of 10 new Connection objects are not closed after the external node is stopped

@sinkingsugar
Copy link
Contributor

sinkingsugar commented Apr 10, 2020

Are we sure that nbc is closing resources properly btw?

Also while calling multiple times close on already closed resources should be fine from the outside in production, I would refrain from writing code with this assumption and try to fix the multiple close calls in our codebase.

@dryajov
Copy link
Contributor

dryajov commented Apr 10, 2020

initial mismatch: we're starting with 30 Connection instances, but only 25 BufferStream ones. Early leak?

Thats normal, there shouldn't be a direct correspondence between connections and BufferStream, the latter are mostly used a virtual stream for mplex channels.

strange inversion after starting the external node the first time: why are there more BufferStream
instances than Connection ones?

This might be a bug.

Connection leak after killing the external node: 10 Connection instances were not closed

NBC needs graceful shutdown, this might very well be due to weird TCP states. Use netstat or equivalent to check for lingering connections, possibly in CLOSE_WAIT or some other state after shutdown.

large number of connections created the second time the external node was started: why are we opening 10 Connection and BufferStream instances for just one node? Is the amount of new connections in direct relation with the number of blocks synced (the first time, 16 new ones were opened)? Aren't syncing connections supposed to be closed before the peer disconnects?

Yes, the Eth2 networking protocol is essentially an RPC protocol, for each request there will be a new stream opened.

4 out of 10 new Connection objects are not closed after the external node is stopped

Not sure what you mean here, is it after switch shutdowns or something else?

@zah
Copy link
Contributor

zah commented Apr 10, 2020

NBC is using the following pattern of managing the streams when making a request:

https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/eth2_network.nim#L436-L446

I think the code needs to be tweaked it a bit to make it safe in the face of timeouts and it's affected by the defer issue discovered by @sinkingsugar, but the issue doesn't prevent closing, it just executes it after makeEth2Request completes.

I'll rework the code a little bit today as I'm integrating the snappy compression. (I'll improve the timeout-handling logic and will temporary switch to finally there)

- just BufferStream and Connection are tracked, for now
- flag checking is enforced more strictly in close(), since it became
  clear that instances are closed multiple times
and sort the list
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants