Increase IO_TIMEOUT to allow nodes on high-latency connections to sync #3109

mmgen · 2019-11-11T10:02:24Z

Commit d3dbafa "Use blocking IO in P2P to reduce CPU load" (merged into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second.

On nodes with high-latency connections, this short timeout causes the txhashset archive download during step 2 of the IBD process to invariably fail before it completes. Since there's no mechanism for resuming a failed download, this means the node gets stuck at this stage and never syncs.

Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others might suggest a more optimal value for the constant.

Commit d3dbafa "Use blocking IO in P2P to reduce CPU load" (merged into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second. On nodes with high-latency connections, this short timeout causes the txhashset archive download during step 2 of the IBD process to invariably fail before it completes. Since there's no mechanism for resuming a failed download, this means the node gets stuck at this stage and never syncs. Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others might suggest a more optimal value for the constant.

DavidBurkett · 2019-11-11T10:07:40Z

Does this change affect shutdown times?

mmgen · 2019-11-11T10:14:22Z

Doesn't seem to. Haven't noticed any difference in my node's behavior.

hashmap · 2019-11-11T10:19:36Z

Yes, it would increase shutdown time. Also we need to figure out why it fails the download, timeout is not an exceptional situation, it doesn't stop reading from a peer. Perhaps we partially read, get timeout and drop the buffer, I'll check it. This would be a short-term fix.
In long term we need to think about using async io on top of tokio or async-std. We used tokio for p2p and got rid of it in early 2018 because it was hard to maintain and reason about (we still have tokio in api layer and it's a pain), but with the latest stable rust brings us async/.await which makes async io is much more pleasant to work with. However this change would require some redesign in message processing area, we need to separate network io from header/block processing.

mmgen · 2019-11-11T10:34:29Z

Agree this is a short-term fix. But it does solve the problem and doesn't seem to cause any serious issues of its own. The downloads were repeatedly failing partway through with EAGAIN. Here's a snippet from the log file:

    DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 2256000/460392662
    DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 9888000/460392662
    ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" })

Suggestion: the fix could be made more fine-grained by using the longer timeout only for the txhashset download. Other connections thus wouldn't be affected.

hashmap · 2019-11-11T10:46:50Z

Interesting. And I was wrong about longer shutdown, it would be the case with the original PR, but then we changed the behavior, now we just close remaining threads after short period of time

mmgen · 2019-11-11T10:50:12Z

I've been unable to observe any difference in the patched node's behavior. Aside from the fact that it now syncs :-)

mmgen mentioned this pull request Nov 11, 2019

txhashset archive save to file fail (connection closed) #2929

Closed

hashmap approved these changes Nov 13, 2019

View reviewed changes

hashmap merged commit 928097a into mimblewimble:master Nov 13, 2019

mmgen deleted the io-timeout-patch branch November 13, 2019 21:11

antiochp added this to the 3.0.0 milestone Dec 11, 2019

antiochp mentioned this pull request Dec 11, 2019

3.0.0 Changelog #3166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase IO_TIMEOUT to allow nodes on high-latency connections to sync #3109

Increase IO_TIMEOUT to allow nodes on high-latency connections to sync #3109

mmgen commented Nov 11, 2019 •

edited

Loading

DavidBurkett commented Nov 11, 2019

mmgen commented Nov 11, 2019 •

edited

Loading

hashmap commented Nov 11, 2019

mmgen commented Nov 11, 2019 •

edited

Loading

hashmap commented Nov 11, 2019

mmgen commented Nov 11, 2019

Increase IO_TIMEOUT to allow nodes on high-latency connections to sync #3109

Increase IO_TIMEOUT to allow nodes on high-latency connections to sync #3109

Conversation

mmgen commented Nov 11, 2019 • edited Loading

DavidBurkett commented Nov 11, 2019

mmgen commented Nov 11, 2019 • edited Loading

hashmap commented Nov 11, 2019

mmgen commented Nov 11, 2019 • edited Loading

hashmap commented Nov 11, 2019

mmgen commented Nov 11, 2019

mmgen commented Nov 11, 2019 •

edited

Loading

mmgen commented Nov 11, 2019 •

edited

Loading

mmgen commented Nov 11, 2019 •

edited

Loading