txhashset archive save to file fail (connection closed) #2929

antiochp · 2019-06-28T08:56:24Z

Added some logging to the txhashset download so we can track progress via the log file (on the 2.x.x branch) -

20190628 09:40:09.321 DEBUG grin_p2p::peer - Asking 67.207.70.122:3414 for txhashset archive at 231880 000018532638.
20190628 09:40:14.687 DEBUG grin_p2p::protocol - handle_payload: txhashset archive for 000018532638 at 231880. size=253592404
20190628 09:40:25.693 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 2976000/253592404
20190628 09:40:36.748 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 9072000/253592404
20190628 09:40:47.769 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 16944000/253592404
20190628 09:40:58.829 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 24384000/253592404
...
20190628 09:49:02.918 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 238656000/253592404
20190628 09:49:14.008 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 245472000/253592404
20190628 09:49:25.148 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 250128000/253592404
20190628 09:49:36.447 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 251232000/253592404
20190628 09:49:38.456 ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" })

I have seen the above scenario a few time now. We download what appears to be almost all the bytes and then the connection is dropped.

Seems like too much of a coincidence that we get so close to getting the whole file and then fail to save it with a "connection closed" (like 99% of the way there on multiple occasions).

I suspect we actually have all the bytes and that the connection is getting closed on the sending side as soon as it has sent everything over (for whatever reason).
We must be able to do something smarter on the receiving side to handle this more robustly.

Related - #2639.

The text was updated successfully, but these errors were encountered:

antiochp · 2019-06-28T09:14:47Z

Here's another instance of it happening -

...
20190628 10:11:09.139 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 261024000/275364910
20190628 10:11:20.201 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 266400000/275364910
20190628 10:11:31.297 DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 271008000/275364910
20190628 10:11:37.064 ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" })

We log the update msg every 10s so here again we got within 10s of completing the download and the connection mysteriously dropped on the other end.

I would bet "several minutes worth of Grin" that we actually received the full file and then our connection handling caused it to fail to save successfully.

hashmap · 2019-06-28T10:40:39Z

Can confirm, saw it recently. Will also check how we handle it.

We use a buffered SyncChannel (buffer size is 10) to store messages to be sent to a remote peer. In send impl we use try_send which return error if channel is closed (peer is disconnected) or the buffer is full. Such error leads to dropping the peer. When we send a txhaset archive a peer's thread is busy with sending it and can't send other messages, eg pings. If the network connection is slow it may lead to channel sending error hence the peer's drop. We could increase buffer, but at the same time we use SyncChannel as regular channel (with unlimited buffer), so it may make more sense to switch to it. It's unlikely that a peer can buffer too many messages, so wasting memory on prealocating a big buffer is not justified. Adresses mimblewimble#2929

antiochp · 2019-06-28T15:30:11Z

@hashmap My understanding of our current threading model is as follows (please correct me if this is wrong) -

bunch of peer threads receiving msgs from our peers
some arbitrary thread sending data out to peers (broadcast of block or tx etc.)
- we use the send_channel here to get the data to the necessary peer thread

~~We do not use the individual peer threads to actually send data out to our peers.~~
But we may receive a tx from one peer and then use the thread corresponding to that peer to broadcast out to all our other peers.

But in the case of msgs that have a response, like the txhashset request we do use the same thread to send data back to that peer.
So we receive a txhashset request on the thread corresponding to a particular peer and then we use that same thread to "reply" to that peer, effectively locking that thread up for an extended period of time (as you mention in #2930).

I wonder if we should consider spawning a separate thread for long running "replies" like the txhashset response?

Edit: This is not correct. We have the send_channel in there - so we send out via the network using the peer thread.

When we send a txhashet archive a peer's thread is busy with sending it and can't send other messages, eg pings. If the network connection is slow buffer capacity 10 may be not enough, hence the peer's drop. Safer attempt to address mimblewimble#2929 in 2.0.0

When we send a txhashet archive a peer's thread is busy with sending it and can't send other messages, eg pings. If the network connection is slow buffer capacity 10 may be not enough, hence the peer's drop. Safer attempt to address #2929 in 2.0.0

When we send a txhashet archive a peer's thread is busy with sending it and can't send other messages, eg pings. If the network connection is slow buffer capacity 10 may be not enough, hence the peer's drop. Safer attempt to address mimblewimble#2929 in 2.0.0

When we send a txhashet archive a peer's thread is busy with sending it and can't send other messages, eg pings. If the network connection is slow buffer capacity 10 may be not enough, hence the peer's drop. Safer attempt to address #2929 in 2.0.0

mmgen · 2019-11-11T09:34:27Z

Related issue: with the introduction of blocking IO in d3dbafa, a failed txhashset archive download throwing the IBD process into an infinite loop has now become an issue with high-latency connections. It's easily solved by increasing the constant IO_TIMEOUT (in p2p/src/conn.rs) to 10000 milliseconds.

See: #3109

mmgen · 2019-11-24T16:41:46Z

Still observing this with freshly-compiled client (master@78220feb). When remote peer is a v1 client, DL gets to 99% and then fails with "connection reset by peer":

    201911DD HH:MM:SS.nnn DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 484608000/489598372
    201911DD HH:MM:SS.nnn ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })

This appears to be a problem with v1 clients, so is obviously unfixable at this point, unless the fix is backported and users agree to upgrade. (Apparently, #2934 didn't help, since these are 2.0.0 clients).

When remote peer is v2 (v2.1.1), I'm observing even stranger behavior: The download begins normally, but after 30 seconds or so the peer just stops sending data. The connection remains open, no error is reported, but the data flow just stops.

The following debugging code reveals that the blocking is occurring at self.stream.read_exact(), which indicates the problem is with the remote end of the connection:

diff --git a/p2p/src/conn.rs b/p2p/src/conn.rs                                                                                                                            
index 4bfe5bc6..efe683ab 100644
--- a/p2p/src/conn.rs
+++ b/p2p/src/conn.rs
@@ -106,10 +106,12 @@ impl<'a> Message<'a> {
        let mut written = 0;
        while written < len {
            let read_len = cmp::min(8000, len - written);
+           debug!("before read: len: {}, written: {}, read_len: {}", len, written, read_len);
            let mut buf = vec![0u8; read_len];
            self.stream.read_exact(&mut buf[..])?;
            writer.write_all(&mut buf)?;
            written += read_len;
+           debug!("after read:  len: {}, written: {}, read_len: {}", len, written, read_len);
        }
        Ok(written)
    }

Both errors have been observed multiple (10x) times, and the behavior is always the same.

If the latter error has been fixed in v3.0.0, then please ignore. There are very few 3.0.0 nodes active on mainnet, so haven't been able to test against them.

quentinlesceller · 2020-05-11T14:05:27Z

@antiochp any update on that issue? Is this still relevant?

mmgen · 2020-05-11T15:21:52Z

@antiochp any update on that issue? Is this still relevant?

Don't know. I haven't fired up a fresh Grin node in awhile. Perhaps I'll do so in the coming days.

antiochp · 2020-05-13T13:44:17Z

I'm going to close this for now. I believe this is fixed in 3.x.x.
If we see this behavior or something similar we can reopen a new issue.

antiochp added the bug label Jun 28, 2019

antiochp added this to the 2.x.x milestone Jun 28, 2019

hashmap mentioned this issue Jun 28, 2019

Replace SyncChannel with Channel in peer #2930

Closed

hashmap mentioned this issue Jun 28, 2019

Increase peer's send buffer #2931

Merged

hashmap mentioned this issue Jun 30, 2019

Increase peer's send buffer (#2931) #2934

Merged

antiochp closed this as completed May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

txhashset archive save to file fail (connection closed) #2929

txhashset archive save to file fail (connection closed) #2929

antiochp commented Jun 28, 2019

antiochp commented Jun 28, 2019

hashmap commented Jun 28, 2019

antiochp commented Jun 28, 2019 •

edited

Loading

mmgen commented Nov 11, 2019 •

edited

Loading

mmgen commented Nov 24, 2019 •

edited

Loading

quentinlesceller commented May 11, 2020

mmgen commented May 11, 2020

antiochp commented May 13, 2020

txhashset archive save to file fail (connection closed) #2929

txhashset archive save to file fail (connection closed) #2929

Comments

antiochp commented Jun 28, 2019

antiochp commented Jun 28, 2019

hashmap commented Jun 28, 2019

antiochp commented Jun 28, 2019 • edited Loading

mmgen commented Nov 11, 2019 • edited Loading

mmgen commented Nov 24, 2019 • edited Loading

quentinlesceller commented May 11, 2020

mmgen commented May 11, 2020

antiochp commented May 13, 2020

antiochp commented Jun 28, 2019 •

edited

Loading

mmgen commented Nov 11, 2019 •

edited

Loading

mmgen commented Nov 24, 2019 •

edited

Loading