-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust's main-stack thread guard is implemented incorrectly; breaks nearly all Rust programs on 64KB-pagesize post-stack-clash Linux #43052
Comments
Some test code in case someone wants to play with it themselves on a 64KB-pagesize system: const SZ : usize = 24576;
fn main() {
let ys: [u64; SZ] = [0; SZ]; // 192KB of space
println!("Hello world! {} {}", ys[0], ys[SZ-1]);
println!("Address: {:p} {:p}", &ys[0], &ys[SZ-1]);
} #!/bin/sh
exec gdb -d ~/glibc-2.19/debian/ -d ~/glibc-2.19/malloc -d ~/glibc-2.19/nptl -d ~/glibc-2.19/sysdeps -q "$@" -ex 'run' ./test |
Just to clarify: Does this mean that this mechanism has always caused a segfault when the stack grew to the page just before (or after, since it grows backwards) the guard page? This would mean that Rust's guard page was never hit on Linux. |
It seems so - you can set up a similar test even on amd64 (4KB pagesize) pre-stack-clash with |
That may just be that you're setting that amd64 stack too small for even the init calls to work. I can get the message from the main thread trivially with recursion, on an older kernel:
But on an x86_64 stack-clash-patched kernel, that just seg-faults. So I guess we can say that stack-clash kernels make the guard page useless with 4KB pages, and actively harmful with 64KB kernels. I think it's reasonable to blame the kernel's behavior though. |
cc @alexcrichton and #42816. While stack probing is currently x86-only, I think it will also have the effect of hitting the kernel's safety margin before anything will reach the custom guard page. |
I can confirm that your test program prints the stack overflow message, but it seems that the rust guard page is still not getting hit, the segfault gets raised and caught by rust without hitting rust's guard page:
I wonder why this doesn't happen on the post-stack-clash kernels though.. |
Ah, I think it's because the init ends up using an offset two pages above its base, and then the signal handler looks for faults only one page below that. So in fact it's never looking for the real guard page, but the one above it that the kernel triggers itself. And in post-stack-clash kernels, the kernel now triggers SIGSEGV many pages above, so it doesn't even match that offset=2 any more. |
Ahhh, OK. So it actually sounds like, whoever originally wrote this code, knew about the actual Linux stack guard behaviour as described above, and hardcoded these offsets in appropriately. They just didn't expect it to change from kernel to kernel. If there is a way to get the kernel stack guard size at runtime, rust's stack guard could actually be made to work without additional kernel patches, we would just need to adjust the offsets in the two pieces of code that you linked. (edit: Though, the behaviour in the case of an unlimited stack would not be very reliable.) |
I think at least on Linux, we might be better not to add our own guard page to the main thread at all. We only need a way to determine if a particular fault was stack-related for reporting purposes. |
Hm I'm a little confused and don't quite understand the OP. It makes sense to me that the main thread has "special stack growth functionality" which manually unmapping some page inbetween messes with. What I don't understand is how this relates to page size or what the meaning of a "post-stack-clash kernel" is. What was changed in the kernel? This is just a problem of error reporting, right? @cuviper note that #42816 can't land right now due to a test failing on Travis but it's passing locally for me (specifically a test that the main thread stack overflowing prints a message). I wonder if that's related to this? Is there a recourse for us to deterministically catch stack overflow on the main thread? Or can we only do that for child threads? |
@alexcrichton "Stack Clash" is a series of vulnerabilities reported recently in a bunch of different projects. Ultimately they came down to triggering some stack overflow vulnerabilities to jump over the stack guard page directly into the heap, resulting in exploits rather than just crashes. A mitigation was added to recent Linux kernels that significantly widens the size of the stack guard, making this exploitation technique much more difficult. The new size of the guard is 256 pages instead of 1 page, hence the relation to the page size (though Linus recently suggested a hard-coded 1MiB gap instead of a page-size-dependent gap). I don't think this is just a problem of error reporting. The issue is that if the main thread stack ever needs to grow, even if there is room above Rust's "manual" guard page, there is no longer room for the stack plus the kernel's new gigantic guard area. This results in a stack overflow segfault for any stack usage above the initially mapped area, well below the expected limit of 8MiB. |
I'll try to restate it. When a program starts, only a relatively small stack is actually mapped to begin with. If you page-fault past the end, the kernel will dynamically add more pages up to the rlimit. But if a new stack page would end up right on top of some existing mapping, the kernel will refuse and you'll get a SIGSEGV. Nobody wants the stack to seemlessly flow into an unrelated mapping, after all. Rust is adding a The new kernel behavior is that instead of enforcing just a one page gap between stack growth and other mappings, the kernel now enforces 256 pages. So now our guard page causes the kernel to fault any access in any of the 256 pages above that. We do get the right On 4KB systems, 256 pages is 1MB, so the placement of our guard page makes the last 1MB of stack unusable. On 64KB systems, 256 pages makes 16MB unusable. With rlimit typically only 8MB, our guard page makes it so the stack effectively can't grow at all. And yes, it does seem likely that this is why #42816 is getting raw SIGSEGVs instead of reporting the stack overflow nicely. I'd guess your own kernel does not have these updates yet, and CI does. |
Thanks for the info @cuviper and @tavianator! That all makes sense to me. Sounds like detection of a stack overflow on the main thread right now basically isn't possible? It seems to me that we should remove the main thread guard page on Linux (as the kernel already has it anyway) and rely on #42816 to guarantee for sure we hit it. Does that sound right? |
@alexcrichton Your conclusion seems correct, but I don't think Linux adds a guard page automatically (at least I couldn't see one with a quick test and didn't find documentation either). It's just pthreads that can add guard pages to threads. However, the mechanism outlined by @infinity0 should result in the same outcome after stack probes land: Linux will simply refuse to grow the stack if it comes too close to another region in memory or if the stack limit is hit. These 2 conditions ensure that the stack stays in bounds and that there will always be an unmapped page below the stack, which will cause a segfault when hit, acting like the guard page (but actually getting hit, unlike the current guard page implementation). |
Right, Linux doesn't have a literal guard page mapping, but it enforces guard pages in effect. I do think we need to stop adding our own guard page to the main thread. Even apart from the current issue, it worries me to have a blind That just leaves us with the error reporting. #42816 will make sure we get a properly signaled fault, but we won't know how to report it as a stack overflow. I'm not sure how to distinguish this now... |
Presumably you can check if |
Ok, but is there a better way to find that stack mapping than parsing I think it's OK if this only detects properly-strided stack probes. We're aborting in any case, so the stack overflow message is just a courtesy to help the user figure out why. |
Parsing The other way would be if you knew the address of every stack probe instruction. Just check if |
Well, AIUI only growth greater than a page will invoke stack probing at all, so RIP could be anything -- that main-recursion I posted will just overflow on its own. I think such parsing is OK, but it needs to be async-signal-safe. (e.g. no |
|
While parsing something in |
Hm well so another option is that we could whitelist architectures with 4kb page sizes. That way on x86/x86_64 and maybe arm you'd at least get good error messages, and programs would be functional at all on powerpc64. Just to make sure I understand, let's say we did this. We'd have to detect this by "any faulting address at most 1mb above the guard page" indicates a stack overflow, and we'd basically be subtracting 1mb from the default stack size on the system, right? Are there downsides to doing so? |
Maybe we can just note the address that we expect rlimit to start faulting,
without literally mapping our own page there? Then we can still match
something in the handler, but we won't trigger the extra padding
heuristics. I think the "offset" will then be -1 as it will be able to use
the full 8MB/whatever and only fault after that.
|
Rather, offset=0 since we match the page below that saved address.
|
Makes sense to me! |
I have it working without a guard page on x86_64; testing on ppc64 now. |
It works on ppc64 too! My
and then aborts as desired with Rust's "stack overflow" message. I'll open a PR soon... |
Awesome thanks @cuviper! |
@cuviper What about the |
@tavianator I think the unlimited case is already handled poorly, and the update will just mean we stop stuffing a guard page at a very distant address. Right now we start with something like this:
With the update, we'll stop mapping that If something else got mapped anywhere in that space , then you'll overflow sooner, but we wouldn't have identified that with or without our guard page. I don't think you can without looking at the current mappings. |
@cuviper Makes sense, thanks! |
This regression is likely to be fixed on the kernel side soon with some heuristic to allow for guard page mappings installed by user-space. However, I think any attempt by Rust or other language run-time to implement its own guard pages on the main thread stack is unlikely to do much good. Also, the size of the stack guard gap is currently (and will probably continue to be) configurable, so please don't make any assumptions about this. |
Could #43102 be caused by this bug? |
The kernel is definitely doing something undesirable/stupid here. Before removing our guard page on Linux, we also need to ensure that all kernel versions which Rust programs will not immediately crash on actually has an effective guard page. |
Linux doesn't allocate the whole stack right away, and the kernel has its own stack-guard mechanism to fault when growing too close to an existing mapping. If we map our own guard, then the kernel starts enforcing a rather large gap above that, rendering much of the possible stack space useless. Instead, we'll just note where we expect rlimit to start faulting, so our handler can report "stack overflow", and trust that the kernel's own stack guard will work. Fixes rust-lang#43052.
Skip the main thread's manual stack guard on Linux Linux doesn't allocate the whole stack right away, and the kernel has its own stack-guard mechanism to fault when growing too close to an existing mapping. If we map our own guard, then the kernel starts enforcing a rather large gap above that, rendering much of the possible stack space useless. Instead, we'll just note where we expect rlimit to start faulting, so our handler can report "stack overflow", and trust that the kernel's own stack guard will work. Fixes #43052. r? @alexcrichton ### Kernel compatibility: Strictly speaking, Rust claims support for Linux kernels >= 2.6.18, and stack guards were only added to mainline in 2.6.36 for [CVE-2010-2240]. But since that vulnerability was so severe, the guards were backported to many stable branches, and Red Hat patched this all the way back to RHEL3's 2.4.21! I think it's reasonable for us to assume that any *supportable* kernel should have these stack guards. At that time, the kernel only enforced one page of padding between the stack and other mappings, but thanks to [Stack Clash] that padding is now much larger, causing #43052. The kernel side of those fixes are in [CVE-2017-1000364], which Red Hat has backported to at least RHEL5's 2.6.18 so far. [CVE-2010-2240]: https://access.redhat.com/security/cve/CVE-2010-2240 [CVE-2017-1000364]: https://access.redhat.com/security/cve/CVE-2017-1000364 [Stack Clash]: https://access.redhat.com/security/vulnerabilities/stackguard
This is the 4.9.34 stable release -- which tend to be rather unstable, according to the various regression reports due to upstream's Stack Clash "fix" ([1], [2], [3]). In return I've essentially reverted commit 27f9070614aa ("mm: larger stack guard gap, between vmas") and its follow-up commits. PaX (and grsec) did handle this just fine since 2010 (see [4]) and even have a sysctl to allow tweaking the gap size at runtime. But, at least, I took the commit as a hint to extend PaX 'n grsec's coverage somewhat to other arches like ARC, S390,... -- untested, though. The newly introduced stack_guard_gap kernel command line parameter was recoined to apply to the existing PaX mechanism and takes a byte value instead of a number of pages, e.g. use 'stack_guard_gap=1M' to mirror upstream's 1MiB gap. It's now still the 64KiB gap which is sufficient, as the problem can't be solved in the kernel anyways. You still need to compile your userland apps with -fstack-check to make unbounded / large / attacker-controlled alloca()s trap on the guard area instead of jumping over it. Signed-off-by: Mathias Krause <[email protected]> [1] rust-lang/rust#43052 [2] https://lwn.net/Articles/727206/ [3] https://lwn.net/Articles/727703/ [4] https://grsecurity.net/~spender/stack_gap_fix.txt Conflicts: arch/arm/mm/mmap.c arch/frv/mm/elf-fdpic.c arch/mips/mm/mmap.c arch/powerpc/mm/slice.c arch/sh/mm/mmap.c arch/sparc/kernel/sys_sparc_64.c arch/sparc/mm/hugetlbpage.c arch/x86/kernel/sys_x86_64.c arch/x86/mm/hugetlbpage.c fs/hugetlbfs/inode.c fs/proc/task_mmu.c mm/mmap.c
This is the 4.9.34 stable release -- which tend to be rather unstable, according to the various regression reports due to upstream's Stack Clash "fix" ([1], [2], [3]). In return I've essentially reverted commit 27f9070614aa ("mm: larger stack guard gap, between vmas") and its follow-up commits. PaX (and grsec) did handle this just fine since 2010 (see [4]) and even have a sysctl to allow tweaking the gap size at runtime. But, at least, I took the commit as a hint to extend PaX 'n grsec's coverage somewhat to other arches like ARC, S390,... -- untested, though. The newly introduced stack_guard_gap kernel command line parameter was recoined to apply to the existing PaX mechanism and takes a byte value instead of a number of pages, e.g. use 'stack_guard_gap=1M' to mirror upstream's 1MiB gap. It's now still the 64KiB gap which is sufficient, as the problem can't be solved in the kernel anyways. You still need to compile your userland apps with -fstack-check to make unbounded / large / attacker-controlled alloca()s trap on the guard area instead of jumping over it. Signed-off-by: Mathias Krause <[email protected]> [1] rust-lang/rust#43052 [2] https://lwn.net/Articles/727206/ [3] https://lwn.net/Articles/727703/ [4] https://grsecurity.net/~spender/stack_gap_fix.txt Conflicts: arch/arm/mm/mmap.c arch/frv/mm/elf-fdpic.c arch/mips/mm/mmap.c arch/powerpc/mm/slice.c arch/sh/mm/mmap.c arch/sparc/kernel/sys_sparc_64.c arch/sparc/mm/hugetlbpage.c arch/x86/kernel/sys_x86_64.c arch/x86/mm/hugetlbpage.c fs/hugetlbfs/inode.c fs/proc/task_mmu.c mm/mmap.c
FWIW, we should probably skip adding our own guard on OS X too. Our current guard page is ending up in the middle of the main thread stack. See #43347. |
Rust implements its own userspace "stack guard", the purpose is to print a nice error message about "stack overflow" rather than segfault. This is applied to each new thread, as well as the main thread:
https://github.com/rust-lang/rust/blob/master/src/libstd/rt.rs#L44
https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/thread.rs#L248
https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/thread.rs#L229
The start address of the stack is calculated effectively via
pthread_attr_getstack (pthread_getattr_np (pthread_self))
This is where the problems occur.
pthread_getattr_np
is not defined by POSIX and the manual page does not specifically define what the exact behaviour is when getting the attributes of the main thread. It works fine for non-main threads because they are usually created with a fixed-sized stack which does not automatically expand. So the start address is pre-defined on those threads.However on Linux (and other systems) the real size of the main stack is not determined in advance; it starts off small and then gets automatically expanded via the process described in great detail in recent articles on the stack-clash bug. In practise, with glibc the above series of function calls returns
top-of-stack - stack-rlimit
when called on the main thread. This is 8MB by default on most machines I've seen.However, most of the space in between is not allocated at the start of the program. For example, a test "Hello World" Rust program has this (after init):
with
ulimit -s unlimited
it looks like this:OTOH, the Linux stack guard is not a physical guard page but just extra logic that prevents the stack from growing too close to another mmap allocation. If I understand correctly: Contrary to the
get_stack_start
function, it does not work based on the stack rlimit, because this could be unlimited. Instead, it works based on the real size of the existing allocated stack. The guard then ensures that the next-highest mapped page remains more thanstack_guard_gap
below the lowest stack address, and if not then it will trigger a segfault.On ppc64el Debian and other systems (Fedora aarch64, Fedora ppc64be, etc) the page size is 64KB. Previously, stack_guard_gap was equal to PAGESIZE. Now, it is 256 * PAGESIZE = 16MB, compared to the default stack size limit of 8MB. So now when Linux tries to expand the stack, it sees that the stack is only (8MB - $existing-size) away from the next-highest mmapped page (Rust's own stack guard) which is smaller than stack_guard_gap (16MB) and so it segfaults.
The logic only "didn't fail" before because the stack_guard_gap was much lower than the default stack rlimit. But even here, it would not have been able to perform its intended purpose of being able to detect a stack overflow, since the kernel's stack-guard logic would have caused a segfault before the real stack ever expanded into Rust's own stack guard.
In case my words aren't the best, here is a nice diagram instead:
[..]
are mapped pages. Now, Linux's stack guard will segfault if there is anything between G and S. For Rust, its own stack guard page at A causes this. Previously, G-S was much smaller and A was lower than G.AIUI, Linux developers are talking at the moment about the best way to "unbreak" programs that do this - they try not to break userspace. But it is nevertheless incorrect behaviour by Rust to do this anyways, and a better way of doing it should be found. Unfortunately I don't have any better ideas at the moment, since the very notion of "stack start address" for main threads is apparently not set in stone by POSIX or other standards, leading to this sort of misunderstanding between kernel vs userspace on where the "start" really is.
I'm not sure about the details of other systems, but if they implement stack guards like how Linux does it (i.e. against the real allocated stack rather than against a stack rlimit), and the relevant numbers match up like they do above, then Rust's main stack guard would also problems there.
CC @arielb1 @bwhacks @cuviper
This causes rust-lang/cargo#4197
The text was updated successfully, but these errors were encountered: