Emulate the vsyscall page in userspace in the x86_64 Docker image #158

geofft · 2018-02-23T05:06:43Z

Since some recent distros are shipping with vsyscall=none by default,
the manylinux1 Docker image doesn't work. Fortunately, we can emulate
everything in userspace by catching segmentation faults for the vsyscall
addresses and forcing the program to use the vDSO instead.

Add an entrypoint to the x86_64 Docker image to detect whether this
emulation is required, and if so, catch these segfaults via ptrace and
adjust the instruction pointer. Using the ptrace syscall at all in
recent versions of Docker requires

docker run --security-opt=seccomp:unconfined

(which an error message will tell you to do if needed).

There is also a mode for the ptrace helper to trace an existing process
and its children. Because docker build doesn't support the
--security-opt option, this can be useful for building the manylinux1
image, by running this helper on docker-containerd.

geofft · 2018-02-23T05:13:49Z

Compared to the approach in #157, this uses ptrace from the outside instead of interposing on signal handlers from the inside, which means no fighting with every possible way to change signal handlers/masks. We only get called once a process has actually segfaulted, so this approach should be a lot more robust.

I've also added a check to make vsyscall_trace get out of the way (just exec instead of forking/tracing a child) if it notices that vsyscalls work fine on the current machine. That means that it only takes effect if your container wasn't going to run anyway on your machine. It also prints some warnings to stderr if it's running (or if it needs to run and ptrace isn't available). So this is significantly more helpful than the status quo: it at least tells you why your container is dying instead of letting it die silently.

I've tested that a Docker container with the entry point here works fine on a machine booted with vsyscall=none (as long as you pass --security-opt=seccomp:unconfined to docker run).

So I'd like to seriously propose this for merge, as long as others feel confident about the logic to bypass vsyscall_trace on a machine where vsyscalls work. I think that safety valve means that the downside of merging this is essentially limited to that logic not working.

For building the manylinux1 container on a machine with vsyscall=none (i.e., if Travis ever switches to that), it turns out that you can't docker build --security-opt, on purpose (moby/moby#21105). So I'm not changing how we build the container yet, but if we need it / someone wants to build the container on their own syste, just run something like

sudo docker/vsyscall_emu/vsyscall_trace -p $(ps -T -o tid= "$(sudo cat /var/run/docker/containerd/docker-containerd.pid)") &

(all threads of docker-containerd) before running docker build. I've confirmed that you can build the container if this process is running, and as soon as you kill it, you're no longer able to build the container.

(CI is broken because GitHub is now TLS 1.2-only, unrelated to this change.)

njsmith · 2018-02-25T08:10:00Z

I hate it, but it's hard to argue with, and the code looks reasonable, and I agree that it seems unlikely to do any harm compared to the status quo.

Has anyone tried getting in touch with the RHEL 6 glibc maintainers and talked to them about fixing this themselves? I don't want us to forget to do that because we have a tricky hack working...

Also, it seems like maybe this code is complex enough that it should have some tests? We run some basic smoke tests on the docker image as part of the build, which should at least exercise the exec path, but that's not a lot of coverage :-)

geofft · 2018-02-25T21:08:29Z

Agreed re tests. We don't actually get any test coverage I think because a) the image build uses docker build, so this code doesn't get run, and b) Travis's kernel supports vsyscalls anyway so this code no-ops even if it were to run.

I'm thinking of grabbing Linux's vsyscall test program https://github.com/torvalds/linux/blob/v4.15/tools/testing/selftests/x86/test_vsyscall.c , sedding both it and this tool to pretend the vsyscall page is at some other address (so that it would segfault on all machines), and making sure that the modified test_vsyscall runs with the modified vsyscall_trace. Seem reasonable?

Re libc, I'd like to keep the manylinux1 Docker working even as manylinux1.1 continues to happen, and I don't think anyone is maintaining the el5 glibc any more, right?

njsmith · 2018-02-25T23:10:15Z

the image build uses docker build, so this code doesn't get run

Ah, you're right -- it looks like we do run our smoke tests from inside docker build. Possibly it would be a good idea in any case to move those out to a separate docker run invocation inside our .travis.yml.

geofft · 2018-02-27T02:24:46Z

Added tests as described above. When Travis upgrades its Docker to 1.10+ we'll need to add a --security-opt=seccomp:unconfined to the test script.

Also the tests caught one bug... my VM with Docker had the vDSO fit in one page, but my normal dev machine had it across two pages, and I was assuming that offsets within the vDSO would all be less than a page.

Since some recent distros are shipping with vsyscall=none by default, the manylinux1 Docker image doesn't work. Fortunately, we can emulate everything in userspace by catching segmentation faults for the vsyscall addresses and forcing the program to use the vDSO instead. Add an entrypoint to the x86_64 Docker image to detect whether this emulation is required, and if so, catch these segfaults via ptrace and adjust the instruction pointer. Using the ptrace syscall at all in recent versions of Docker requires docker run --security-opt=seccomp:unconfined (which an error message will tell you to do if needed). There is also a mode for the ptrace helper to trace an existing process and its children. Because `docker build` doesn't support the `--security-opt` option, this can be useful for building the manylinux1 image, by running this helper on docker-containerd.

It can be more than one page.

geofft · 2018-03-04T18:00:50Z

Anything else needed here? @markrwilliams is trying to convince me that this should really be Docker's job to fix (since Docker is providing the promise of ABI compatibility), but even if we get something into Docker it still seems worth merging in the meantime.

njsmith · 2018-03-04T21:26:31Z

I'd like to have some kind of testing that docker run on our new image actually works, even if it's just docker run /bin/true (though the self tests we already have might be a better choice than true).

The choice to build the trace program in the host system surprised me, though I see how there's a bootstrapping problem otherwise. Are we getting lucky and it works because it happens not to use any symbols that have changed since centos5 was released? Do we have a backup plan if that changes? Aren't we already requiring the bootstrap to work without this because this doesn't work at docker build time anyway?

I agree that docker ought to fix this, but I doubt they will. (How would they even do that?) Convincing RH to do something seems more viable. It's true that they won't fix centos5, but if they fix centos6 then that at least gives us a path to get rid of this eventually.

geofft · 2018-03-05T04:13:26Z

I'd like to have some kind of testing that docker run on our new image actually works, even if it's just docker run /bin/true (though the self tests we already have might be a better choice than true).

Did you see the tests I added in the second commit in the series? It uses docker run on the newly-built image; it just emulates and tests a different vsyscall address so that it actually tests something on current Travis. (It does also have a mode where it doesn't use docker run if you don't specify an image, which I found useful for development, but I have .travis.yml using it against the image it just built.)

Using a normal docker run on Travis isn't going to tell us much because Travis currently supports vsyscalls, and so it vsyscall_trace will take the no-op path. So running something like bash or python -c "import time" is only going to be useful if we add a second CI service with vsyscall=none, or run our own qemu VM or something on Travis (which is not actually hard, I've done it, it just seems like overkill).

I can confirm anecdotally that docker run with the vsyscall_trace entrypoint works on my test VM with vsyscall=none.

The choice to build the trace program in the host system surprised me, though I see how there's a bootstrapping problem otherwise. Are we getting lucky and it works because it happens not to use any symbols that have changed since centos5 was released?

Yes, or put another way, the trace program is itself compliant with the manylinux1 profile. I sort of wanted to actually use auditwheel on it as a test, but I couldn't find any easy way to make auditwheel audit a standalone ELF binary. Maybe that's a useful mode to add to auditwheel in general?

Do we have a backup plan if that changes? Aren't we already requiring the bootstrap to work without this because this doesn't work at docker build time anyway?

The bootstrap backup plan (i.e., "what happens if Travis switches to vsyscall=none") is to build vsyscall_trace on the host system and then attach it to docker-containerd as described above.

The ABI compatibility backup plan (i.e., "what happens if Travis updates libc incompatibly") is that you do the above if necessary to make docker build work, then build it (again) within docker build and use that binary as the entrypoint in the shipped container.

We could, currently, move the build to inside docker build, since we're not running it on the host presently. I just left it in this order because I'm more worried about the bootstrap backup plan than about the ABI compatibility backup plan. There are current distros with vsyscall=none; there are to my knowledge no current distros that would build a non-manylinux1-compliant vsyscall_trace.

I agree that docker ought to fix this, but I doubt they will. (How would they even do that?)

Essentially, merge this code into docker-containerd: have it PTRACE_SEIZE any container that needs vsyscall emulation and do the same fixups. I haven't figured out the details, but the plan in my mind is basically to add a flag somewhere (I don't know where this fits in the image format) indicating that the container image needs vsyscall emulation, have Docker test at startup whether the host kernel supports vsyscalls, and have Docker notice if a container's pid 1 segfaults in the vsyscall page and print a useful error message saying the flag needs to be enabled. I agree that I'm not sure if they'll actually want to make these changes.

njsmith · 2018-03-13T02:22:37Z

Using a normal docker run on Travis isn't going to tell us much because Travis currently supports vsyscalls, and so it vsyscall_trace will take the no-op path.

Calling docker run on Travis tells us that we can call docker run on Travis, which is a rather important fact that's not entirely trivial :-). The tests you added look great, but they override the --entrypoint and stuff, which doesn't tell us much about the actual docker images.

We could, currently, move the build to inside docker build, since we're not running it on the host presently. I just left it in this order because I'm more worried about the bootstrap backup plan than about the ABI compatibility backup plan.

OK, I'm glad I asked this question and I'm satisfied with the answer :-). Can you paste the URL of your comment into .travis.yml, right above the call to make, as an explanation for why we seem to be doing a strange thing there?

ehashman · 2018-07-03T23:40:20Z

Would e9493d5 address the issue, such that we can close this PR?

ehashman · 2018-11-18T22:43:39Z

Does this also affect manylinux2010?

markrwilliams · 2018-11-18T23:07:17Z

@ehashman Yes, manylinux2010's Docker image is still affected because CentOS 6 uses an older version of glibc :(

I included a section about it in the PEP: https://www.python.org/dev/peps/pep-0571/#compatibility-with-kernels-that-lack-vsyscall

EDIT: More relevantly, I included some mumbo jumbo about this in the Manylinux2 PR:

#152 (comment)

I could very well be wrong that this is necessary, though!

ehashman · 2018-11-19T03:43:10Z

Got it, and having caught up with the PEP, I understand how this would also affect manylinux2010.

I'm with @njsmith on hating this :) @geofft: what's your opinion on the glibc patch and rebuild I commented above? I think I prefer that as a slightly less hacky solution, but I'm curious to hear your take.

mayeut · 2019-04-12T03:54:50Z

Superseded by glibc patch in #279 that's been merged-in.
If maintaining the glibc patch causes too much troubles then this method shall be reconsidered.

geofft mentioned this pull request Feb 23, 2018

WIP: Emulate the vsyscall page in userspace in the x86_64 Docker image #157

Closed

geofft force-pushed the vsyscall-trace branch 2 times, most recently from c1f5440 to 6337fce Compare March 3, 2018 00:44

geofft added 2 commits March 2, 2018 20:33

vsyscall_trace: Add a test mode using a faked vsyscall base address

c2da56a

vsyscall_emu: Handle the vDSO correctly

cfa8674

It can be more than one page.

geofft force-pushed the vsyscall-trace branch from 6337fce to cfa8674 Compare March 3, 2018 01:34

markrwilliams mentioned this pull request Apr 14, 2018

Manylinux2 #152

Closed

njsmith mentioned this pull request May 19, 2018

Add Python 3.7.0b4 #196

Merged

geofft mentioned this pull request Nov 30, 2018

NodeJS native modules made difficult by distribution packages. nodejs/node#21897

Closed

rdb mentioned this pull request Dec 17, 2018

manylinux docker image bash segfault w/ recent kernel #254

Closed

trishankatdatadog mentioned this pull request Apr 10, 2019

Tracking issue for manylinux2010 rollout #179

Closed

14 tasks

mayeut closed this Apr 12, 2019

Mizux mentioned this pull request May 2, 2019

WHEEL file size mismatches for ortools-7.0.6546-cp36-cp36m-manylinux1_x86_64.whl google/or-tools#1218

Closed

squeaky-pl mentioned this pull request Nov 20, 2019

Enable vsyscall=emulate in the kernel config to run older base images such as Centos 6 microsoft/WSL#4694

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emulate the vsyscall page in userspace in the x86_64 Docker image #158

Emulate the vsyscall page in userspace in the x86_64 Docker image #158

geofft commented Feb 23, 2018

geofft commented Feb 23, 2018

njsmith commented Feb 25, 2018

geofft commented Feb 25, 2018

njsmith commented Feb 25, 2018

geofft commented Feb 27, 2018

geofft commented Mar 4, 2018

njsmith commented Mar 4, 2018

geofft commented Mar 5, 2018

njsmith commented Mar 13, 2018

ehashman commented Jul 3, 2018

ehashman commented Nov 18, 2018

markrwilliams commented Nov 18, 2018 •

edited

Loading

ehashman commented Nov 19, 2018

mayeut commented Apr 12, 2019

Emulate the vsyscall page in userspace in the x86_64 Docker image #158

Emulate the vsyscall page in userspace in the x86_64 Docker image #158

Conversation

geofft commented Feb 23, 2018

geofft commented Feb 23, 2018

njsmith commented Feb 25, 2018

geofft commented Feb 25, 2018

njsmith commented Feb 25, 2018

geofft commented Feb 27, 2018

geofft commented Mar 4, 2018

njsmith commented Mar 4, 2018

geofft commented Mar 5, 2018

njsmith commented Mar 13, 2018

ehashman commented Jul 3, 2018

ehashman commented Nov 18, 2018

markrwilliams commented Nov 18, 2018 • edited Loading

ehashman commented Nov 19, 2018

mayeut commented Apr 12, 2019

markrwilliams commented Nov 18, 2018 •

edited

Loading