Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase IO_TIMEOUT to allow nodes on high-latency connections to sync #3109

Merged
merged 1 commit into from
Nov 13, 2019

Conversation

mmgen
Copy link
Contributor

@mmgen mmgen commented Nov 11, 2019

Commit d3dbafa "Use blocking IO in P2P to reduce CPU load" (merged into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second.

On nodes with high-latency connections, this short timeout causes the txhashset archive download during step 2 of the IBD process to invariably fail before it completes. Since there's no mechanism for resuming a failed download, this means the node gets stuck at this stage and never syncs.

Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others might suggest a more optimal value for the constant.

Commit d3dbafa "Use blocking IO in P2P to reduce CPU load" (merged
into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second.

On nodes with high-latency connections, this short timeout causes the
txhashset archive download during step 2 of the IBD process to
invariably fail before it completes.  Since there's no mechanism for
resuming a failed download, this means the node gets stuck at this stage
and never syncs.

Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others
might suggest a more optimal value for the constant.
@DavidBurkett
Copy link
Contributor

Does this change affect shutdown times?

@mmgen
Copy link
Contributor Author

mmgen commented Nov 11, 2019

Doesn't seem to. Haven't noticed any difference in my node's behavior.

@hashmap
Copy link
Contributor

hashmap commented Nov 11, 2019

Yes, it would increase shutdown time. Also we need to figure out why it fails the download, timeout is not an exceptional situation, it doesn't stop reading from a peer. Perhaps we partially read, get timeout and drop the buffer, I'll check it. This would be a short-term fix.
In long term we need to think about using async io on top of tokio or async-std. We used tokio for p2p and got rid of it in early 2018 because it was hard to maintain and reason about (we still have tokio in api layer and it's a pain), but with the latest stable rust brings us async/.await which makes async io is much more pleasant to work with. However this change would require some redesign in message processing area, we need to separate network io from header/block processing.

@mmgen
Copy link
Contributor Author

mmgen commented Nov 11, 2019

Agree this is a short-term fix. But it does solve the problem and doesn't seem to cause any serious issues of its own. The downloads were repeatedly failing partway through with EAGAIN. Here's a snippet from the log file:

    DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 2256000/460392662
    DEBUG grin_p2p::protocol - handle_payload: txhashset archive: 9888000/460392662
    ERROR grin_p2p::protocol - handle_payload: txhashset archive save to file fail. err=Connection(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" })

Suggestion: the fix could be made more fine-grained by using the longer timeout only for the txhashset download. Other connections thus wouldn't be affected.

@hashmap
Copy link
Contributor

hashmap commented Nov 11, 2019

Interesting. And I was wrong about longer shutdown, it would be the case with the original PR, but then we changed the behavior, now we just close remaining threads after short period of time

@mmgen
Copy link
Contributor Author

mmgen commented Nov 11, 2019

I've been unable to observe any difference in the patched node's behavior. Aside from the fact that it now syncs :-)

@hashmap hashmap merged commit 928097a into mimblewimble:master Nov 13, 2019
@mmgen mmgen deleted the io-timeout-patch branch November 13, 2019 21:11
@antiochp antiochp added this to the 3.0.0 milestone Dec 11, 2019
@antiochp antiochp mentioned this pull request Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants