Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to epoll() on armhf #1089

Closed
kinnison opened this issue Sep 17, 2019 · 8 comments
Closed

Failure to epoll() on armhf #1089

kinnison opened this issue Sep 17, 2019 · 8 comments

Comments

@kinnison
Copy link

Hi,

I, and a friend (@cjwatson), have been trying to diagnose a problem whereby rustup on armhf is failing to connect to a web proxy when using reqwest which is built on hyper and thence tokio and thus mio ultimately.

On amd64 (x86_64) it runs perfectly well, but on armhf it sits for 30s and times out the connect.

By stracing the entire build, eventually we were able to find the following syscall sequence on armhf:

[pid  3517] 06:37:57.516581 futex(0xf933b8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=29, tv_nsec=990974355} <unfinished ...>
[pid  3518] 06:37:57.516671 <... fcntl64 resumed> ) = 0x2 (flags O_RDWR)
[pid  3518] 06:37:57.516762 fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  3518] 06:37:57.516894 connect(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  3518] 06:37:57.521838 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0
[pid  3517] 06:38:27.507984 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Importantly you can see the futex for the 30s timeout, the connect starting, being added to the epoll, but then nothing occurring for 30s and the futex timing out. The thread which called epoll_ctl() never seems to call anything else and eventually exits at the end of the process.

By comparison, running on amd64 (x86_64):

[pid  3516] 06:46:05.403825 connect(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, 16 <unfinished ...>
[pid  3520] 06:46:05.404022 sched_getaffinity(3520, 32,  <unfinished ...>
[pid  3516] 06:46:05.404049 <... connect resumed> ) = -1 EINPROGRESS (Operation now in progress)
[pid  3516] 06:46:05.404073 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}} <unfinished ...>
[pid  3520] 06:46:05.404089 <... sched_getaffinity resumed> [0, 1, 2, 3]) = 32
[pid  3516] 06:46:05.404113 <... epoll_ctl resumed> ) = 0
[pid  3520] 06:46:05.404124 futex(0x7f3d54026ed0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  3516] 06:46:05.404156 epoll_wait(4, [{EPOLLOUT, {u32=0, u64=0}}], 1024, 0) = 1
[pid  3516] 06:46:05.404195 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid  3516] 06:46:05.404232 setsockopt(7, SOL_TCP, TCP_NODELAY, [0], 4) = 0
[pid  3516] 06:46:05.404263 getpeername(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, [128->16]) = 0
[pid  3516] 06:46:05.404304 writev(7, [{iov_base="CONNECT static.rust-lang.org:443 HTTP/1.1\r\nHost: static.rust-lang.org:443\r\n\r\n", iov_len=77}], 1) = 77

Where the connect is in progress, gets added to the epoll, and then the same thread enters epoll_wait() succeeds, and the connection proceeds as expected.

I'm unsure if this is mio or tokio at fault, but I figured this was the right place to start.

This is the armhf log: buildlog_snap_ubuntu_bionic_armhf_rust-snap-test_BUILDING.txt.gz
The amd64 log I have, but is 50M so I can't paste it here and I'm not sure it'd help that much.

If this is not the right place, please let me know and I'll re-file appropriately.

Thanks,

D.

@Thomasdezeeuw
Copy link
Collaborator

Some preliminary findings:

The problem code seems to be (around) here: https://github.com/seanmonstar/reqwest/blob/45f67c1dccf30ed82d172e31e5dc6ef2b7acf2ea/src/client.rs#L634-L669. The ClientHandle struct has a timeout field which defaults to 30 seconds (which matches the futex wait).

Mio doesn't use any Futexes, so it's likely coming from: https://github.com/seanmonstar/reqwest/blob/45f67c1dccf30ed82d172e31e5dc6ef2b7acf2ea/src/wait.rs#L96-L98?

[pid  3518] 06:37:57.521838 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0
[pid  3517] 06:38:27.507984 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Is that the entire log? Because that would mean that Mio would be stuck in epoll_ctl for 30 minutes or makes no other system calls. Neither cases make sense to me, I would expect an epoll_wait call after for pid 3518 (just like on amd64). Any chance you can check the duration of the epoll_ctl call? If its not stuck in the epoll_ctl then my current best guess is that there is something wrong with the user code that comes after.

Version info for the future:
rustup uses reqwest 0.9.14, tokio 0.1.7, mio 0.6.14.

As a side note: why is rewqest using async API's to provide a sync API? Why bother with all that complexity if you're not going to take advantage of the benefits? (genuine question)

@kinnison
Copy link
Author

Thank you @Thomasdezeeuw for that analysis. I agree that reqwest probably is what is introducing the 30s timeout -- Given reqwest's implementation of waiting, could it simply be that on armhf there are insufficient threads in the worker pool so none of the futures/connections can make progress? If so, I suppose I ought to reassign this to reqwest.

@Thomasdezeeuw
Copy link
Collaborator

could it simply be that on armhf there are insufficient threads in the worker pool so none of the futures/connections can make progress?

Maybe, but I don't know for sure. It could really help if you can determine how long the call to epoll_ctl takes, because if we're calling that for 30 minutes it might even be a bug in Linux.

@kinnison
Copy link
Author

Maybe, but I don't know for sure. It could really help if you can determine how long the call to epoll_ctl takes, because if we're calling that for 30 minutes it might even be a bug in Linux.

I've asked the person who made the straces if they can do another with extra timing data, but if the epoll_ctl() was stuck, I'd expect to see it ...unfinished when the futex returned. My guess is that it completed fairly quickly and then the epoll was never waited on.

@kinnison
Copy link
Author

@Thomasdezeeuw From another run, that epoll_ctl() takes no time at all:

[pid  3521] 10:20:23.214229 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0 <0.000066>

So it's got to be something other than that getting stuck

@kinnison
Copy link
Author

@Thomasdezeeuw
Copy link
Collaborator

I don't think this a problem with Mio then. It think it somewhere between reqwest and Tokio, maybe how one uses the API of the other (is reqwest driving Tokio's reactor in some way?). I would normally expect an epoll_wait (from Poll::poll) call, but that is nowhere in the trace. I can't tell where this is going wrong nor do I understand why this is only a problem on armhf.

@kinnison
Copy link
Author

Thanks for your time then, I'll try attacking this from the reqwest side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants