Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request never connects on armhf #642

Closed
kinnison opened this issue Sep 18, 2019 · 17 comments
Closed

Request never connects on armhf #642

kinnison opened this issue Sep 18, 2019 · 17 comments
Labels
B-upstream Blocked: upstream. Depends on a dependency to make a change first.

Comments

@kinnison
Copy link

Hi, this was originally discussed in tokio-rs/mio#1089 where we decided that it probably made sense to migrate the discussion to here.

In brief -- A friend (@cjwatson) and I have been diagnosing a fault in rustup on armhf in Snapcraft's build environment. It seems to sit for 30s trying to connect and then fails. This only seems to happen on armhf -- on other platforms it connects just fine.

An strace of the attempt shows:

[pid  3517] 06:37:57.516581 futex(0xf933b8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=29, tv_nsec=990974355} <unfinished ...>
[pid  3518] 06:37:57.516671 <... fcntl64 resumed> ) = 0x2 (flags O_RDWR)
[pid  3518] 06:37:57.516762 fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  3518] 06:37:57.516894 connect(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  3518] 06:37:57.521838 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0
[pid  3517] 06:38:27.507984 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

(Further straces show the epoll_ctl() call takes microsecnds, so it's not actually stuck in it for 30s, but the thread which did the epoll_ctl() call subsequently did nothing. (Trace attached to the mio bug so I won't reattach it here).

Interestingly in that strace we never get to epoll_wait() on armhf.

I had previously assumed it was probably mio at fault, but the discussion there suggests it's more likely in the reqwest/tokio interfacing, so I brought the issue to here to discuss further.

@seanmonstar
Copy link
Owner

Hm, do you have easy access to an armhf machine so we can work through this together?

The futex wait is because the main thread is parking until the async runtime thread makes progress and returns a Response. If we want to eliminate that as a problem, we could try just running the async example. If that doesn't work, then I'd suspect the issue is lower in the stack, either tokio or mio.

@kinnison
Copy link
Author

I don't have real armhf hardware to hand to try arbitrary stuff on -- those straces came from the snapcraft build infrastructure itself. I will see if I can replicate the issue running in qemu-user-static on my laptop. If I can, I'll see if that example has similar issues.

@cjwatson
Copy link

If somebody can work out how to wedge the relevant test code into snapcraft then I can also try running it on our infrastructure.

@kinnison
Copy link
Author

I have failed at replicating the issue on my x86 laptop using qemu-user so I imagine we will have to try @cjwatson 's idea -- Problem is, I don't know what I'd do to do that. I'll also see if I can fake the number of CPUs which is reported to rustup in case there's something spawning ncpus threads for the worker pool and that's what's going on.

@kinnison
Copy link
Author

Even isolating to a 1 CPU VM, I couldn't replicate it on x86_64 with qemu-user so we're going to have to try something else. I am firing up an armhf instance in scaleway (or at least trying to) to see if I can replicate on there.

@seanmonstar seanmonstar added the B-upstream Blocked: upstream. Depends on a dependency to make a change first. label Sep 27, 2019
@kinnison
Copy link
Author

kinnison commented Oct 2, 2019

I failed to replicate it myself. I wonder if it has something to do with the virtualisation that is done for Snapcraft, combined with something else in the stack of reqwest? @seanmonstar is the upstream label suggesting you've filed another bug elsewhere?

@lnicola
Copy link

lnicola commented Oct 2, 2019

I have a Raspberry Pi I can try to reproduce this on, if you think it would help.

@seanmonstar
Copy link
Owner

The upstream label is a guess that it's either in mio or tokio. Neither reqwest nor hyper have conditional code per target.

@kinnison
Copy link
Author

kinnison commented Oct 2, 2019

Aah, as per the original post, I first discussed this with the mio folks in tokio-rs/mio#1089 and they suggested here. I'm now worried that noone knows what's going on. I'm not sure it'll be platform specific so much as perhaps an interaction between something "interesting" on armhf, and the particular size of the system snapcraft are using. The oddness was that epoll_ctl() was called, but then the epoll was never checked, which points perhaps at an executor with too few threads?

@popey
Copy link

popey commented Oct 9, 2019

If you have test cases I can help wrangle them into snapcraft with whatever tracing / debugging is needed so we can run that on the infrastructure exhibiting the issue. (I am affected as one of my snaps fails in this way - @cjwatson sent me this way and I'd like to help where I can).

@seanmonstar
Copy link
Owner

A good first step would be trying the async example, which would help determine if the issue is about the blocking API not allowing epoll to run.

@tesuji
Copy link
Contributor

tesuji commented Dec 10, 2019

@popey is there any progresses ?
Edit: The async example runs well in aarch64-linux-gnu machine. I don't have armhf to test it.

@x448
Copy link
Contributor

x448 commented Apr 27, 2020

EDIT: sorry, I didn't see kinnison's work using QEMU on this. Not sure if others can reproduce issue by trying different settings like 2+ cpu, etc.

Running Ubuntu 16.04.1 armhf on Qemu
https://gist.github.com/takeshixx/686a4b5e057deff7892913bf69bcb85a

This is a writeup about how to install Ubuntu 16.04.1 Xenial Xerus for the 32-bit hard-float ARMv7 (armhf) architecture on a Qemu VM via Ubuntu netboot.

The setup will create a Ubuntu VM with LPAE extensions (generic-lpae) enabled. However, this writeup should also work for non-LPAE (generic) kernels.

The performance of the resulting VM is quite good, and it allows VMs with >1G ram ...
...
The netboot files are available on the official Ubuntu mirror.

First comment on this gist is from Nov 2016 but there are comments as recent as April 20, 2020 that solve networking issues some people had.

@kinnison
Copy link
Author

@x448 Thanks, but the issue is in the Snapcraft builder VMs, so I'd guess Canonical are okay at configuring qemu properly, and since it tends to work for everything else I remain confused as to why reqwest fails.

@cjwatson
Copy link

We're also using actual hardware, not ARM-on-x86. qemu is still involved, but unlikely to be very much related to that gist.

@cjwatson
Copy link

We may possibly have got to the bottom of this. See lxc/lxcfs#553.

@kinnison
Copy link
Author

Looks like we should close this off, thank you @cjwatson and @seanmonstar for your efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-upstream Blocked: upstream. Depends on a dependency to make a change first.
Projects
None yet
Development

No branches or pull requests

7 participants