-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emulate the vsyscall page in userspace in the x86_64 Docker image #158
Conversation
Compared to the approach in #157, this uses ptrace from the outside instead of interposing on signal handlers from the inside, which means no fighting with every possible way to change signal handlers/masks. We only get called once a process has actually segfaulted, so this approach should be a lot more robust. I've also added a check to make vsyscall_trace get out of the way (just exec instead of forking/tracing a child) if it notices that vsyscalls work fine on the current machine. That means that it only takes effect if your container wasn't going to run anyway on your machine. It also prints some warnings to stderr if it's running (or if it needs to run and ptrace isn't available). So this is significantly more helpful than the status quo: it at least tells you why your container is dying instead of letting it die silently. I've tested that a Docker container with the entry point here works fine on a machine booted with So I'd like to seriously propose this for merge, as long as others feel confident about the logic to bypass vsyscall_trace on a machine where vsyscalls work. I think that safety valve means that the downside of merging this is essentially limited to that logic not working. For building the manylinux1 container on a machine with
(all threads of docker-containerd) before running (CI is broken because GitHub is now TLS 1.2-only, unrelated to this change.) |
I hate it, but it's hard to argue with, and the code looks reasonable, and I agree that it seems unlikely to do any harm compared to the status quo. Has anyone tried getting in touch with the RHEL 6 glibc maintainers and talked to them about fixing this themselves? I don't want us to forget to do that because we have a tricky hack working... Also, it seems like maybe this code is complex enough that it should have some tests? We run some basic smoke tests on the docker image as part of the build, which should at least exercise the |
Agreed re tests. We don't actually get any test coverage I think because a) the image build uses I'm thinking of grabbing Linux's vsyscall test program https://github.com/torvalds/linux/blob/v4.15/tools/testing/selftests/x86/test_vsyscall.c , sedding both it and this tool to pretend the vsyscall page is at some other address (so that it would segfault on all machines), and making sure that the modified test_vsyscall runs with the modified vsyscall_trace. Seem reasonable? Re libc, I'd like to keep the manylinux1 Docker working even as manylinux1.1 continues to happen, and I don't think anyone is maintaining the el5 glibc any more, right? |
Ah, you're right -- it looks like we do run our smoke tests from inside docker build. Possibly it would be a good idea in any case to move those out to a separate |
Added tests as described above. When Travis upgrades its Docker to 1.10+ we'll need to add a Also the tests caught one bug... my VM with Docker had the vDSO fit in one page, but my normal dev machine had it across two pages, and I was assuming that offsets within the vDSO would all be less than a page. |
Since some recent distros are shipping with vsyscall=none by default, the manylinux1 Docker image doesn't work. Fortunately, we can emulate everything in userspace by catching segmentation faults for the vsyscall addresses and forcing the program to use the vDSO instead. Add an entrypoint to the x86_64 Docker image to detect whether this emulation is required, and if so, catch these segfaults via ptrace and adjust the instruction pointer. Using the ptrace syscall at all in recent versions of Docker requires docker run --security-opt=seccomp:unconfined (which an error message will tell you to do if needed). There is also a mode for the ptrace helper to trace an existing process and its children. Because `docker build` doesn't support the `--security-opt` option, this can be useful for building the manylinux1 image, by running this helper on docker-containerd.
c1f5440
to
6337fce
Compare
It can be more than one page.
Anything else needed here? @markrwilliams is trying to convince me that this should really be Docker's job to fix (since Docker is providing the promise of ABI compatibility), but even if we get something into Docker it still seems worth merging in the meantime. |
I'd like to have some kind of testing that The choice to build the trace program in the host system surprised me, though I see how there's a bootstrapping problem otherwise. Are we getting lucky and it works because it happens not to use any symbols that have changed since centos5 was released? Do we have a backup plan if that changes? Aren't we already requiring the bootstrap to work without this because this doesn't work at I agree that docker ought to fix this, but I doubt they will. (How would they even do that?) Convincing RH to do something seems more viable. It's true that they won't fix centos5, but if they fix centos6 then that at least gives us a path to get rid of this eventually. |
Did you see the tests I added in the second commit in the series? It uses Using a normal I can confirm anecdotally that
Yes, or put another way, the trace program is itself compliant with the manylinux1 profile. I sort of wanted to actually use auditwheel on it as a test, but I couldn't find any easy way to make auditwheel audit a standalone ELF binary. Maybe that's a useful mode to add to auditwheel in general?
The bootstrap backup plan (i.e., "what happens if Travis switches to The ABI compatibility backup plan (i.e., "what happens if Travis updates libc incompatibly") is that you do the above if necessary to make We could, currently, move the build to inside
Essentially, merge this code into docker-containerd: have it |
Calling
OK, I'm glad I asked this question and I'm satisfied with the answer :-). Can you paste the URL of your comment into |
Would e9493d5 address the issue, such that we can close this PR? |
Does this also affect manylinux2010? |
@ehashman Yes, manylinux2010's Docker image is still affected because CentOS 6 uses an older version of glibc :( I included a section about it in the PEP: https://www.python.org/dev/peps/pep-0571/#compatibility-with-kernels-that-lack-vsyscall EDIT: More relevantly, I included some mumbo jumbo about this in the Manylinux2 PR: I could very well be wrong that this is necessary, though! |
Got it, and having caught up with the PEP, I understand how this would also affect manylinux2010. I'm with @njsmith on hating this :) @geofft: what's your opinion on the glibc patch and rebuild I commented above? I think I prefer that as a slightly less hacky solution, but I'm curious to hear your take. |
Superseded by glibc patch in #279 that's been merged-in. |
Since some recent distros are shipping with vsyscall=none by default,
the manylinux1 Docker image doesn't work. Fortunately, we can emulate
everything in userspace by catching segmentation faults for the vsyscall
addresses and forcing the program to use the vDSO instead.
Add an entrypoint to the x86_64 Docker image to detect whether this
emulation is required, and if so, catch these segfaults via ptrace and
adjust the instruction pointer. Using the ptrace syscall at all in
recent versions of Docker requires
(which an error message will tell you to do if needed).
There is also a mode for the ptrace helper to trace an existing process
and its children. Because
docker build
doesn't support the--security-opt
option, this can be useful for building the manylinux1image, by running this helper on docker-containerd.