shutdown devp2p connections by debris · Pull Request #9711 · openethereum/parity-ethereum

debris · 2018-10-07T09:08:36Z

debris

Changes made in this pull request should solve the issue described in #9656 and possibly several other issues related with parity sync. Unfortunately there are no unit tests for host.rs and I believe that in current state of it is almost not testable. Please manually test it before merging and in the meantime I'll create an issue to refactor this code.

debris · 2018-10-07T09:09:36Z

 			Err(()) => {
 				trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id);
-				io.disconnect_peer(peer_id);
+				io.disable_peer(peer_id);


unrelated issue, if chunk is bad, we should disable peer (disconnect + mark as bad)

Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?

We can't request 2 different snapshots anyway, so I guess this should be OK.

debris · 2018-10-07T09:10:16Z


 impl GenericSocket for TcpStream {
+	fn shutdown(&self) -> io::Result<()> {
+		self.shutdown(Shutdown::Both)


we never shutdown tcp stream

debris · 2018-10-07T09:13:35Z

 	}

-	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
+	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {


I removed remote flag as I believe it was confusing and often used incorrectly.

e.g.

fn stop was calling it for all nodes with remote set to true and because of that, we were calling note_failure for all of them

So now we are deregistering the stream everytime we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

Second issue: So now we don't note_failure on any of the previous kill_connection(_,_, true) - to avoid breaking stuff we should still review all call sites of that function if note_failure should be there.

Yeah, seems strange to remove remote just because it was wrongly used. Maybe just rename the variable? Or have a deregister bool and a note_failure one.

So now we are deregistering the stream every time we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

@tomusdrw the leak is happening now. Session is always set to expired, session is removed from handlers, but the value is never deregistered if remote = false, cause s.done() returns true as long as the socket is writeable.

In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.

https://github.com/paritytech/parity-ethereum/blob/05550429c7dc0e154710d171bb432e68d000e34d/util/network-devp2p/src/host.rs#L917-L926

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

I believe that CLOSE_WAIT state is just a symptom of a bigger problem. After we call kill_connection, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT state.

debris · 2018-10-07T09:17:23Z

 			}
 		}
-		if deregister {
-			io.deregister_stream(token).unwrap_or_else(|e| debug!("Error deregistering stream: {:?}", e));


deregister set to false lead to incorrect state of the application. connection that was supposed to be killed, was never killed and occupied an entry in sessions hashmap

tomusdrw

couple of issues

tomusdrw · 2018-10-08T11:27:33Z

 			Err(()) => {
 				trace!(target: "sync", "{}: Got bad snapshot chunk", peer_id);
-				io.disconnect_peer(peer_id);
+				io.disable_peer(peer_id);


Are we sure this is ok? How are the chunks requested? What if the peer just sent a chunk from different snapshot and we don't differentiate that on the request level?

tomusdrw · 2018-10-08T11:36:50Z

 	}

-	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
+	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {


So now we are deregistering the stream everytime we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

tomusdrw · 2018-10-08T11:38:43Z

 	}

-	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
+	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {


Second issue: So now we don't note_failure on any of the previous kill_connection(_,_, true) - to avoid breaking stuff we should still review all call sites of that function if note_failure should be there.

tomusdrw · 2018-10-08T11:39:13Z

 		trace!(target: "network", "Hup: {}", stream);
 		match stream {
-			FIRST_SESSION ... LAST_SESSION => self.connection_closed(stream, io),
+			FIRST_SESSION ... LAST_SESSION => self.kill_connection(stream, io),


Would make sense to note_failure here imho.

tomusdrw · 2018-10-08T11:39:47Z


 			if kill {
-				self.kill_connection(token, io, true);
+				self.kill_connection(token, io);


Should we note_failure?

in some cases of kill = true, we already noted failure, when in others we did not. I'll fix it.

tomusdrw · 2018-10-08T11:40:04Z

 	fn connection_timeout(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {
 		trace!(target: "network", "Connection timeout: {}", token);
-		self.kill_connection(token, io, true)
+		self.kill_connection(token, io)


IMHO we should note_failure

debris

I agree that the logic here is extremely fragile. Apart from the leaked sessions I believe that we often call note_failure when we should not. Because changes in this pull request are quite complex and controversial, I'll try to prepare tests proving that they are working

debris · 2018-10-09T12:12:04Z

 	}

-	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>, remote: bool) {
+	fn kill_connection(&self, token: StreamToken, io: &IoContext<NetworkIoMessage>) {


So now we are deregistering the stream every time we call kill_connection. Previously it was done only if session was done. Seems that deregister will remove entry from self.sessions, but only when session is expired - otherwise we will leak the the value in the HashMap forever.

@tomusdrw the leak is happening now. Session is always set to expired, session is removed from handlers, but the value is never deregistered if remote = false, cause s.done() returns true as long as the socket is writeable.

In this pr the leak is not happening, cause after setting expired to true, and unregistering handlers, we always call deregister.

https://github.com/paritytech/parity-ethereum/blob/05550429c7dc0e154710d171bb432e68d000e34d/util/network-devp2p/src/host.rs#L917-L926

The code is really fragile here, and I think we should be careful with removing stuff that just "seems used incorrectly", especially when it runs for couple of years now and we are not really sure if there are any bugs in that code (i.e. is this strictly related to TCP streams hanging in the WAIT state?).

I believe that CLOSE_WAIT state is just a symptom of a bigger problem. After we call kill_connection, sessions leak in memory and they are already disconnected from the handlers. Remote nodes close the connection to our unresponsive socket, but because they are no handlers connected, we are stuck on the CLOSE_WAIT state.

debris · 2018-10-09T12:14:13Z


 			if kill {
-				self.kill_connection(token, io, true);
+				self.kill_connection(token, io);


in some cases of kill = true, we already noted failure, when in others we did not. I'll fix it.

andresilva

Overall LGTM, but there's a couple of places where we need to note_failure.

andresilva · 2018-10-23T13:14:09Z

 		for p in to_kill {
 			trace!(target: "network", "Ping timeout: {}", p);
-			self.kill_connection(p, io, true);
+			self.kill_connection(p, io);


Should we note_failure here as well?

5chdn · 2018-10-26T11:37:56Z

are you still working on this @debris ?

debris added 2 commits October 5, 2018 11:38

shutdown devp2p connections, closes #9656

9f3db52

fix kill_connection

0555042

debris added A0-pleasereview 🤓 Pull request needs code review. B1-patch-beta 🕷 M4-core ⛓ Core client code / Rust. B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. labels Oct 7, 2018

debris commented Oct 7, 2018

View reviewed changes

debris requested a review from tomusdrw October 7, 2018 09:21

debris mentioned this pull request Oct 7, 2018

refactor and tests for host.rs #9712

Closed

tomusdrw suggested changes Oct 8, 2018

View reviewed changes

This was referenced Oct 8, 2018

Backports for stable 2.0.7 #9648

Merged

Backports for beta 2.1.2 #9649

Merged

5chdn added this to the 2.2 milestone Oct 9, 2018

debris commented Oct 9, 2018

View reviewed changes

jam10o-new mentioned this pull request Oct 10, 2018

Long RPC response time #9699

Closed

This was referenced Oct 15, 2018

Beta release 2.1.3 backports #9749

Merged

Stable release 2.0.8 backports #9748

Merged

5chdn added A5-grumble 🔥 Pull request has minor issues that must be addressed before merging. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 15, 2018

This was referenced Oct 20, 2018

Backports: parity stable 2.0.9 #9786

Merged

Backports: parity beta 2.1.4 #9787

Merged

andresilva suggested changes Oct 23, 2018

View reviewed changes

5chdn added the A3-stale 🍃 Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Oct 26, 2018

5chdn modified the milestones: 2.2, 2.3 Oct 29, 2018

This was referenced Nov 1, 2018

version: mark 2.2.0 beta #9820

Merged

version: mark 2.1.5 stable #9821

Merged

5chdn closed this Nov 25, 2018

soc1c deleted the shutdown-devp2p-connections branch March 20, 2019 11:31

Conversation

debris commented Oct 7, 2018

Uh oh!

debris left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomusdrw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debris left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andresilva left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

5chdn commented Oct 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants