Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emulate the vsyscall page in userspace in the x86_64 Docker image #158

Closed
wants to merge 3 commits into from

Conversation

geofft
Copy link

@geofft geofft commented Feb 23, 2018

Since some recent distros are shipping with vsyscall=none by default,
the manylinux1 Docker image doesn't work. Fortunately, we can emulate
everything in userspace by catching segmentation faults for the vsyscall
addresses and forcing the program to use the vDSO instead.

Add an entrypoint to the x86_64 Docker image to detect whether this
emulation is required, and if so, catch these segfaults via ptrace and
adjust the instruction pointer. Using the ptrace syscall at all in
recent versions of Docker requires

docker run --security-opt=seccomp:unconfined

(which an error message will tell you to do if needed).

There is also a mode for the ptrace helper to trace an existing process
and its children. Because docker build doesn't support the
--security-opt option, this can be useful for building the manylinux1
image, by running this helper on docker-containerd.

@geofft
Copy link
Author

geofft commented Feb 23, 2018

Compared to the approach in #157, this uses ptrace from the outside instead of interposing on signal handlers from the inside, which means no fighting with every possible way to change signal handlers/masks. We only get called once a process has actually segfaulted, so this approach should be a lot more robust.

I've also added a check to make vsyscall_trace get out of the way (just exec instead of forking/tracing a child) if it notices that vsyscalls work fine on the current machine. That means that it only takes effect if your container wasn't going to run anyway on your machine. It also prints some warnings to stderr if it's running (or if it needs to run and ptrace isn't available). So this is significantly more helpful than the status quo: it at least tells you why your container is dying instead of letting it die silently.

I've tested that a Docker container with the entry point here works fine on a machine booted with vsyscall=none (as long as you pass --security-opt=seccomp:unconfined to docker run).

So I'd like to seriously propose this for merge, as long as others feel confident about the logic to bypass vsyscall_trace on a machine where vsyscalls work. I think that safety valve means that the downside of merging this is essentially limited to that logic not working.

For building the manylinux1 container on a machine with vsyscall=none (i.e., if Travis ever switches to that), it turns out that you can't docker build --security-opt, on purpose (moby/moby#21105). So I'm not changing how we build the container yet, but if we need it / someone wants to build the container on their own syste, just run something like

sudo docker/vsyscall_emu/vsyscall_trace -p $(ps -T -o tid= "$(sudo cat /var/run/docker/containerd/docker-containerd.pid)") &

(all threads of docker-containerd) before running docker build. I've confirmed that you can build the container if this process is running, and as soon as you kill it, you're no longer able to build the container.

(CI is broken because GitHub is now TLS 1.2-only, unrelated to this change.)

@njsmith
Copy link
Member

njsmith commented Feb 25, 2018

I hate it, but it's hard to argue with, and the code looks reasonable, and I agree that it seems unlikely to do any harm compared to the status quo.

Has anyone tried getting in touch with the RHEL 6 glibc maintainers and talked to them about fixing this themselves? I don't want us to forget to do that because we have a tricky hack working...

Also, it seems like maybe this code is complex enough that it should have some tests? We run some basic smoke tests on the docker image as part of the build, which should at least exercise the exec path, but that's not a lot of coverage :-)

@geofft
Copy link
Author

geofft commented Feb 25, 2018

Agreed re tests. We don't actually get any test coverage I think because a) the image build uses docker build, so this code doesn't get run, and b) Travis's kernel supports vsyscalls anyway so this code no-ops even if it were to run.

I'm thinking of grabbing Linux's vsyscall test program https://github.com/torvalds/linux/blob/v4.15/tools/testing/selftests/x86/test_vsyscall.c , sedding both it and this tool to pretend the vsyscall page is at some other address (so that it would segfault on all machines), and making sure that the modified test_vsyscall runs with the modified vsyscall_trace. Seem reasonable?

Re libc, I'd like to keep the manylinux1 Docker working even as manylinux1.1 continues to happen, and I don't think anyone is maintaining the el5 glibc any more, right?

@njsmith
Copy link
Member

njsmith commented Feb 25, 2018

the image build uses docker build, so this code doesn't get run

Ah, you're right -- it looks like we do run our smoke tests from inside docker build. Possibly it would be a good idea in any case to move those out to a separate docker run invocation inside our .travis.yml.

@geofft
Copy link
Author

geofft commented Feb 27, 2018

Added tests as described above. When Travis upgrades its Docker to 1.10+ we'll need to add a --security-opt=seccomp:unconfined to the test script.

Also the tests caught one bug... my VM with Docker had the vDSO fit in one page, but my normal dev machine had it across two pages, and I was assuming that offsets within the vDSO would all be less than a page.

Since some recent distros are shipping with vsyscall=none by default,
the manylinux1 Docker image doesn't work. Fortunately, we can emulate
everything in userspace by catching segmentation faults for the vsyscall
addresses and forcing the program to use the vDSO instead.

Add an entrypoint to the x86_64 Docker image to detect whether this
emulation is required, and if so, catch these segfaults via ptrace and
adjust the instruction pointer. Using the ptrace syscall at all in
recent versions of Docker requires

    docker run --security-opt=seccomp:unconfined

(which an error message will tell you to do if needed).

There is also a mode for the ptrace helper to trace an existing process
and its children. Because `docker build` doesn't support the
`--security-opt` option, this can be useful for building the manylinux1
image, by running this helper on docker-containerd.
@geofft geofft force-pushed the vsyscall-trace branch 2 times, most recently from c1f5440 to 6337fce Compare March 3, 2018 00:44
@geofft
Copy link
Author

geofft commented Mar 4, 2018

Anything else needed here? @markrwilliams is trying to convince me that this should really be Docker's job to fix (since Docker is providing the promise of ABI compatibility), but even if we get something into Docker it still seems worth merging in the meantime.

@njsmith
Copy link
Member

njsmith commented Mar 4, 2018

I'd like to have some kind of testing that docker run on our new image actually works, even if it's just docker run /bin/true (though the self tests we already have might be a better choice than true).

The choice to build the trace program in the host system surprised me, though I see how there's a bootstrapping problem otherwise. Are we getting lucky and it works because it happens not to use any symbols that have changed since centos5 was released? Do we have a backup plan if that changes? Aren't we already requiring the bootstrap to work without this because this doesn't work at docker build time anyway?

I agree that docker ought to fix this, but I doubt they will. (How would they even do that?) Convincing RH to do something seems more viable. It's true that they won't fix centos5, but if they fix centos6 then that at least gives us a path to get rid of this eventually.

@geofft
Copy link
Author

geofft commented Mar 5, 2018

I'd like to have some kind of testing that docker run on our new image actually works, even if it's just docker run /bin/true (though the self tests we already have might be a better choice than true).

Did you see the tests I added in the second commit in the series? It uses docker run on the newly-built image; it just emulates and tests a different vsyscall address so that it actually tests something on current Travis. (It does also have a mode where it doesn't use docker run if you don't specify an image, which I found useful for development, but I have .travis.yml using it against the image it just built.)

Using a normal docker run on Travis isn't going to tell us much because Travis currently supports vsyscalls, and so it vsyscall_trace will take the no-op path. So running something like bash or python -c "import time" is only going to be useful if we add a second CI service with vsyscall=none, or run our own qemu VM or something on Travis (which is not actually hard, I've done it, it just seems like overkill).

I can confirm anecdotally that docker run with the vsyscall_trace entrypoint works on my test VM with vsyscall=none.

The choice to build the trace program in the host system surprised me, though I see how there's a bootstrapping problem otherwise. Are we getting lucky and it works because it happens not to use any symbols that have changed since centos5 was released?

Yes, or put another way, the trace program is itself compliant with the manylinux1 profile. I sort of wanted to actually use auditwheel on it as a test, but I couldn't find any easy way to make auditwheel audit a standalone ELF binary. Maybe that's a useful mode to add to auditwheel in general?

Do we have a backup plan if that changes? Aren't we already requiring the bootstrap to work without this because this doesn't work at docker build time anyway?

The bootstrap backup plan (i.e., "what happens if Travis switches to vsyscall=none") is to build vsyscall_trace on the host system and then attach it to docker-containerd as described above.

The ABI compatibility backup plan (i.e., "what happens if Travis updates libc incompatibly") is that you do the above if necessary to make docker build work, then build it (again) within docker build and use that binary as the entrypoint in the shipped container.

We could, currently, move the build to inside docker build, since we're not running it on the host presently. I just left it in this order because I'm more worried about the bootstrap backup plan than about the ABI compatibility backup plan. There are current distros with vsyscall=none; there are to my knowledge no current distros that would build a non-manylinux1-compliant vsyscall_trace.

I agree that docker ought to fix this, but I doubt they will. (How would they even do that?)

Essentially, merge this code into docker-containerd: have it PTRACE_SEIZE any container that needs vsyscall emulation and do the same fixups. I haven't figured out the details, but the plan in my mind is basically to add a flag somewhere (I don't know where this fits in the image format) indicating that the container image needs vsyscall emulation, have Docker test at startup whether the host kernel supports vsyscalls, and have Docker notice if a container's pid 1 segfaults in the vsyscall page and print a useful error message saying the flag needs to be enabled. I agree that I'm not sure if they'll actually want to make these changes.

@njsmith
Copy link
Member

njsmith commented Mar 13, 2018

Using a normal docker run on Travis isn't going to tell us much because Travis currently supports vsyscalls, and so it vsyscall_trace will take the no-op path.

Calling docker run on Travis tells us that we can call docker run on Travis, which is a rather important fact that's not entirely trivial :-). The tests you added look great, but they override the --entrypoint and stuff, which doesn't tell us much about the actual docker images.

We could, currently, move the build to inside docker build, since we're not running it on the host presently. I just left it in this order because I'm more worried about the bootstrap backup plan than about the ABI compatibility backup plan.

OK, I'm glad I asked this question and I'm satisfied with the answer :-). Can you paste the URL of your comment into .travis.yml, right above the call to make, as an explanation for why we seem to be doing a strange thing there?

@markrwilliams markrwilliams mentioned this pull request Apr 14, 2018
@njsmith njsmith mentioned this pull request May 19, 2018
@ehashman
Copy link
Member

ehashman commented Jul 3, 2018

Would e9493d5 address the issue, such that we can close this PR?

@ehashman
Copy link
Member

Does this also affect manylinux2010?

@markrwilliams
Copy link

markrwilliams commented Nov 18, 2018

@ehashman Yes, manylinux2010's Docker image is still affected because CentOS 6 uses an older version of glibc :(

I included a section about it in the PEP: https://www.python.org/dev/peps/pep-0571/#compatibility-with-kernels-that-lack-vsyscall

EDIT: More relevantly, I included some mumbo jumbo about this in the Manylinux2 PR:

#152 (comment)

I could very well be wrong that this is necessary, though!

@ehashman
Copy link
Member

Got it, and having caught up with the PEP, I understand how this would also affect manylinux2010.

I'm with @njsmith on hating this :) @geofft: what's your opinion on the glibc patch and rebuild I commented above? I think I prefer that as a slightly less hacky solution, but I'm curious to hear your take.

@mayeut
Copy link
Member

mayeut commented Apr 12, 2019

Superseded by glibc patch in #279 that's been merged-in.
If maintaining the glibc patch causes too much troubles then this method shall be reconsidered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants