shutdown devp2p connections#9711
Conversation
debris
left a comment
There was a problem hiding this comment.
Changes made in this pull request should solve the issue described in #9656 and possibly several other issues related with parity sync. Unfortunately there are no unit tests for host.rs and I believe that in current state of it is almost not testable. Please manually test it before merging and in the meantime I'll create an issue to refactor this code.
| Err(()) => { | ||
| trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id); | ||
| io.disconnect_peer(peer_id); | ||
| io.disable_peer(peer_id); |
There was a problem hiding this comment.
unrelated issue, if chunk is bad, we should disable peer (disconnect + mark as bad)
There was a problem hiding this comment.
Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?
There was a problem hiding this comment.
We can't request 2 different snapshots anyway, so I guess this should be OK.
|
|
||
| impl GenericSocket for TcpStream { | ||
| fn shutdown(&self) -> io::Result<()> { | ||
| self.shutdown(Shutdown::Both) |
There was a problem hiding this comment.
we never shutdown tcp stream
| } | ||
|
|
||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
I removed remote flag as I believe it was confusing and often used incorrectly.
e.g.
fn stopwas calling it for all nodes with remote set totrueand because of that, we were callingnote_failurefor all of them
There was a problem hiding this comment.
So now we are deregistering the stream everytime we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
There was a problem hiding this comment.
Second issue: So now we don't note_failure on any of the previous kill_connection(_,_, true) - to avoid breaking stuff we should still review all call sites of that function if note_failure should be there.
There was a problem hiding this comment.
Yeah, seems strange to remove remote just because it was wrongly used. Maybe just rename the variable? Or have a deregister bool and a note_failure one.
There was a problem hiding this comment.
So now we are deregistering the stream every time we call
kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.
@tomusdrw the leak is happening now. Session is always set to expired, session is removed from handlers, but the value is never deregistered if remote = false, cause s.done() returns true as long as the socket is writeable.
In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
I believe that CLOSE_WAIT state is just a symptom of a bigger problem. After we call kill_connection, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT state.
| } | ||
| } | ||
| if deregister { | ||
| io.deregister_stream(token).unwrap_or_else(|e| debug!("Error deregistering stream: {:?}", e)); |
There was a problem hiding this comment.
deregister set to false lead to incorrect state of the application. connection that was supposed to be killed, was never killed and occupied an entry in sessions hashmap
| Err(()) => { | ||
| trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id); | ||
| io.disconnect_peer(peer_id); | ||
| io.disable_peer(peer_id); |
There was a problem hiding this comment.
Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?
| } | ||
|
|
||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
So now we are deregistering the stream everytime we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
| } | ||
|
|
||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
Second issue: So now we don't note_failure on any of the previous kill_connection(_,_, true) - to avoid breaking stuff we should still review all call sites of that function if note_failure should be there.
| trace!(target: "network", "Hup: {}", stream); | ||
| match stream { | ||
| FIRST_SESSION ... LAST_SESSION => self.connection_closed(stream, io), | ||
| FIRST_SESSION ... LAST_SESSION => self.kill_connection(stream, io), |
There was a problem hiding this comment.
Would make sense to note_failure here imho.
|
|
||
| if kill { | ||
| self.kill_connection(token, io, true); | ||
| self.kill_connection(token, io); |
There was a problem hiding this comment.
in some cases of kill = true, we already noted failure, when in others we did not. I'll fix it.
| fn connection_timeout(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { | ||
| trace!(target: "network", "Connection timeout: {}", token); | ||
| self.kill_connection(token, io, true) | ||
| self.kill_connection(token, io) |
There was a problem hiding this comment.
IMHO we should note_failure
debris
left a comment
There was a problem hiding this comment.
I agree that the logic here is extremely fragile. Apart from the leaked sessions I believe that we often call note_failure when we should not. Because changes in this pull request are quite complex and controversial, I'll try to prepare tests proving that they are working
| } | ||
|
|
||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) { | ||
| fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) { |
There was a problem hiding this comment.
So now we are deregistering the stream every time we call
kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.
@tomusdrw the leak is happening now. Session is always set to expired, session is removed from handlers, but the value is never deregistered if remote = false, cause s.done() returns true as long as the socket is writeable.
In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.
The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).
I believe that CLOSE_WAIT state is just a symptom of a bigger problem. After we call kill_connection, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT state.
|
|
||
| if kill { | ||
| self.kill_connection(token, io, true); | ||
| self.kill_connection(token, io); |
There was a problem hiding this comment.
in some cases of kill = true, we already noted failure, when in others we did not. I'll fix it.
andresilva
left a comment
There was a problem hiding this comment.
Overall LGTM, but there's a couple of places where we need to note_failure.
| for p in to_kill { | ||
| trace!(target: "network", "Ping timeout: {}", p); | ||
| self.kill_connection(p, io, true); | ||
| self.kill_connection(p, io); |
There was a problem hiding this comment.
Should we note_failure here as well?
|
are you still working on this @debris ? |
fixes #9656