Failure to epoll() on armhf #1089

kinnison · 2019-09-17T13:29:58Z

Hi,

I, and a friend (@cjwatson), have been trying to diagnose a problem whereby rustup on armhf is failing to connect to a web proxy when using reqwest which is built on hyper and thence tokio and thus mio ultimately.

On amd64 (x86_64) it runs perfectly well, but on armhf it sits for 30s and times out the connect.

By stracing the entire build, eventually we were able to find the following syscall sequence on armhf:

[pid  3517] 06:37:57.516581 futex(0xf933b8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=29, tv_nsec=990974355} <unfinished ...>
[pid  3518] 06:37:57.516671 <... fcntl64 resumed> ) = 0x2 (flags O_RDWR)
[pid  3518] 06:37:57.516762 fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  3518] 06:37:57.516894 connect(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  3518] 06:37:57.521838 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0
[pid  3517] 06:38:27.507984 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Importantly you can see the futex for the 30s timeout, the connect starting, being added to the epoll, but then nothing occurring for 30s and the futex timing out. The thread which called epoll_ctl() never seems to call anything else and eventually exits at the end of the process.

By comparison, running on amd64 (x86_64):

[pid  3516] 06:46:05.403825 connect(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, 16 <unfinished ...>
[pid  3520] 06:46:05.404022 sched_getaffinity(3520, 32,  <unfinished ...>
[pid  3516] 06:46:05.404049 <... connect resumed> ) = -1 EINPROGRESS (Operation now in progress)
[pid  3516] 06:46:05.404073 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}} <unfinished ...>
[pid  3520] 06:46:05.404089 <... sched_getaffinity resumed> [0, 1, 2, 3]) = 32
[pid  3516] 06:46:05.404113 <... epoll_ctl resumed> ) = 0
[pid  3520] 06:46:05.404124 futex(0x7f3d54026ed0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  3516] 06:46:05.404156 epoll_wait(4, [{EPOLLOUT, {u32=0, u64=0}}], 1024, 0) = 1
[pid  3516] 06:46:05.404195 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid  3516] 06:46:05.404232 setsockopt(7, SOL_TCP, TCP_NODELAY, [0], 4) = 0
[pid  3516] 06:46:05.404263 getpeername(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, [128->16]) = 0
[pid  3516] 06:46:05.404304 writev(7, [{iov_base="CONNECT static.rust-lang.org:443 HTTP/1.1\r\nHost: static.rust-lang.org:443\r\n\r\n", iov_len=77}], 1) = 77

Where the connect is in progress, gets added to the epoll, and then the same thread enters epoll_wait() succeeds, and the connection proceeds as expected.

I'm unsure if this is mio or tokio at fault, but I figured this was the right place to start.

This is the armhf log: buildlog_snap_ubuntu_bionic_armhf_rust-snap-test_BUILDING.txt.gz
The amd64 log I have, but is 50M so I can't paste it here and I'm not sure it'd help that much.

If this is not the right place, please let me know and I'll re-file appropriately.

Thanks,

D.

The text was updated successfully, but these errors were encountered:

Thomasdezeeuw · 2019-09-17T14:51:01Z

Some preliminary findings:

The problem code seems to be (around) here: https://github.com/seanmonstar/reqwest/blob/45f67c1dccf30ed82d172e31e5dc6ef2b7acf2ea/src/client.rs#L634-L669. The ClientHandle struct has a timeout field which defaults to 30 seconds (which matches the futex wait).

Mio doesn't use any Futexes, so it's likely coming from: https://github.com/seanmonstar/reqwest/blob/45f67c1dccf30ed82d172e31e5dc6ef2b7acf2ea/src/wait.rs#L96-L98?

[pid  3518] 06:37:57.521838 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0
[pid  3517] 06:38:27.507984 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Is that the entire log? Because that would mean that Mio would be stuck in epoll_ctl for 30 minutes or makes no other system calls. Neither cases make sense to me, I would expect an epoll_wait call after for pid 3518 (just like on amd64). Any chance you can check the duration of the epoll_ctl call? If its not stuck in the epoll_ctl then my current best guess is that there is something wrong with the user code that comes after.

Version info for the future:
rustup uses reqwest 0.9.14, tokio 0.1.7, mio 0.6.14.

As a side note: why is rewqest using async API's to provide a sync API? Why bother with all that complexity if you're not going to take advantage of the benefits? (genuine question)

kinnison · 2019-09-17T19:17:34Z

Thank you @Thomasdezeeuw for that analysis. I agree that reqwest probably is what is introducing the 30s timeout -- Given reqwest's implementation of waiting, could it simply be that on armhf there are insufficient threads in the worker pool so none of the futures/connections can make progress? If so, I suppose I ought to reassign this to reqwest.

Thomasdezeeuw · 2019-09-18T08:55:17Z

could it simply be that on armhf there are insufficient threads in the worker pool so none of the futures/connections can make progress?

Maybe, but I don't know for sure. It could really help if you can determine how long the call to epoll_ctl takes, because if we're calling that for 30 minutes it might even be a bug in Linux.

kinnison · 2019-09-18T08:58:56Z

Maybe, but I don't know for sure. It could really help if you can determine how long the call to epoll_ctl takes, because if we're calling that for 30 minutes it might even be a bug in Linux.

I've asked the person who made the straces if they can do another with extra timing data, but if the epoll_ctl() was stuck, I'd expect to see it ...unfinished when the futex returned. My guess is that it completed fairly quickly and then the epoll was never waited on.

kinnison · 2019-09-18T12:45:14Z

@Thomasdezeeuw From another run, that epoll_ctl() takes no time at all:

[pid  3521] 10:20:23.214229 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0 <0.000066>

So it's got to be something other than that getting stuck

kinnison · 2019-09-18T12:45:28Z

buildlog_snap_ubuntu_bionic_armhf_rust-snap-test_BUILDING-with-timing.txt.gz

Thomasdezeeuw · 2019-09-18T14:05:43Z

I don't think this a problem with Mio then. It think it somewhere between reqwest and Tokio, maybe how one uses the API of the other (is reqwest driving Tokio's reactor in some way?). I would normally expect an epoll_wait (from Poll::poll) call, but that is nowhere in the trace. I can't tell where this is going wrong nor do I understand why this is only a problem on armhf.

kinnison · 2019-09-18T14:09:14Z

Thanks for your time then, I'll try attacking this from the reqwest side of things.

kinnison closed this as completed Sep 18, 2019

kinnison mentioned this issue Sep 18, 2019

Request never connects on armhf seanmonstar/reqwest#642

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to epoll() on armhf #1089

Failure to epoll() on armhf #1089

kinnison commented Sep 17, 2019

Thomasdezeeuw commented Sep 17, 2019

kinnison commented Sep 17, 2019

Thomasdezeeuw commented Sep 18, 2019

kinnison commented Sep 18, 2019

kinnison commented Sep 18, 2019

kinnison commented Sep 18, 2019

Thomasdezeeuw commented Sep 18, 2019

kinnison commented Sep 18, 2019

Failure to epoll() on armhf #1089

Failure to epoll() on armhf #1089

Comments

kinnison commented Sep 17, 2019

Thomasdezeeuw commented Sep 17, 2019

kinnison commented Sep 17, 2019

Thomasdezeeuw commented Sep 18, 2019

kinnison commented Sep 18, 2019

kinnison commented Sep 18, 2019

kinnison commented Sep 18, 2019

Thomasdezeeuw commented Sep 18, 2019

kinnison commented Sep 18, 2019