Skip to content

p2p/simulations: Various fixes#15198

Merged
fjl merged 10 commits into
ethereum:masterfrom
ethersphere:p2p-simulations-fixes
Dec 1, 2017
Merged

p2p/simulations: Various fixes#15198
fjl merged 10 commits into
ethereum:masterfrom
ethersphere:p2p-simulations-fixes

Conversation

@lmars
Copy link
Copy Markdown
Contributor

@lmars lmars commented Sep 25, 2017

Taken from ethersphere/swarm#119.

This PR includes a hack to force a re-dial when adding static peers (9184728), as sometimes re-dials don't happen in large simulations.

This was discussed between me, @zelig and @fjl where I proposed adding a separate DialPeer function to the p2p.Server but after attempting to do so it seems to be more work than originally thought. We need to make a decision on whether this hack can be merged as is, and if not then decide how to proceed.

@zelig
Copy link
Copy Markdown
Contributor

zelig commented Sep 29, 2017

@fjl please advise on this. what should happen with the HACK commit?

@fjl
Copy link
Copy Markdown
Contributor

fjl commented Nov 9, 2017

Sorry, I was super busy. I don't like the HACK commit.

@lmars
Copy link
Copy Markdown
Contributor Author

lmars commented Nov 10, 2017

Some thoughts from a discussion in the orange-lounge gitter channel about replacing the "hack" commit with something better:


The hive code currently tries to dial peers which it should be connected to but isn't currently, and it does this by calling p2p.Server.AddPeer, but AddPeer doesn't always dial a peer, it just adds the peer to a static list of peers that it believes it should redial at some point in the future, and under some conditions the peer never gets dialled.

What we want is a DialPeer method which just attempts to do a dial every time it is called, and then the hive can manage doing the retries (this is what we discussed with @fjl and he was happy with the plan, I just didn't have the time to implement it).

More longer term, we would really like to have what is outlined in #2254.


zelig and others added 7 commits November 13, 2017 12:26
 * refactor simulations/network connection getters
   to support avoiding simultaneous dials between two peers
   If two peers dial simultaneously, the connection will be dropped
   to help avoid that, we essentially lock the connection object
   with a timestamp which serves as a ban on dialing for
   a period of time (dialBanTimeout).

 * The connection getter InitConn can be wrapped and passed to
   the nodes via adapters.NodeConfig#Reachable field and then
   used by the respective services when they initiate connections.

   This massively stablise the emerging connectivity when running
   with hundreds of nodes bootstrapping a network

 * introduce EnableMsgEvents boolean field in NodeConfig
 rlpx tries to send discreason to disconnected peer
 if the connection is net.Pipe (in-memory simulation)
 it hangs forever, since net.Pipe does not implement
 a write deadline. This commit adds error checking
 on the SetWriteDeadline call and only tries to send
 the disconnect reason message if there is no error
To support debugging in-memory network simulations when
multiple peers are logging
* SetupConn now returns error
* dial checks the error and calls resolve on failed dial only
  the server refuses to redial static peers if using
  dialstate.addstatic call

  instead the dialtask is directly appended to the queue
  this fixes the no redial problem but is clearly a hack
@nonsense
Copy link
Copy Markdown
Member

nonsense commented Nov 13, 2017

There are a few todos on this PR:

  • srv.log is nil in tests, causing them to crash
  • net.Pipe does not support write deadline, causing rlpx_test.go to fail (because of 3348b90). Switching to tcpPipe appears to be flaky - working most of the time, but also crashing on occasion.
--- FAIL: TestProtocolHandshake (0.01s)
        rlpx_test.go:216: error receiving disconnect: read tcp 127.0.0.1:53210->127.0.0.1:53209: use of closed network connection
FAIL
FAIL    github.com/ethereum/go-ethereum/p2p     1.723s
  • address HACK commit

I will look into addressing these.

@nonsense nonsense force-pushed the p2p-simulations-fixes branch from 2ff442e to 331e154 Compare November 15, 2017 13:44
@nonsense
Copy link
Copy Markdown
Member

Rebased on go-ethereum/master and reverted the HACK commit.

@nonsense
Copy link
Copy Markdown
Member

I have addressed the TODOs mentioned above in the following way:

  1. HACK commit - I don't know why this was necessary in the first place. I believe it was added due to TestDiscoverySimulationExecAdapter (which is not part of this PR) failing, but ultimately the commit is not needed in my opinion.
  2. srv.log is nil in tests, causing them to crash - I initialised the logger for each Server as part of the tests. srv.log was added so that we have a unique server ID per server instance, when running multiple instances in simulation tests.
  3. SetWriteDeadline error handling has been reverted as well, because we use SetWriteDeadline at multiple places, but this is the only place where we check for error. If we want to handle it, we should handle it everywhere (which appears to be out-of-scope for this PR). Furthermore net.Pipe does not support it, so we'd have to amend all tests to use another pipe altogether. I believe the error check was added because of deadlocks in simulation tests, but I think this is independent of the SetWriteDeadline error check.

@nonsense
Copy link
Copy Markdown
Member

@lmars @fjl this is ready for review. The failures in CI appear to be unrelated to this PR.

@fjl fjl merged commit 54aeb8e into ethereum:master Dec 1, 2017
@lmars lmars deleted the p2p-simulations-fixes branch December 1, 2017 14:33
@karalabe karalabe added this to the 1.8.0 milestone Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants