-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support alloca #618
Comments
While we don’t have segmented stack anymore, I don’t know how we could make it work with stack probes. Maybe do another probe every time alloca gets called? |
Is anyone drafting an rfc for this? |
I don't think so. I believe there's a pretty large contingent that doesn't want to see alloca at all. On Feb 29, 2016, 07:19 -0500, [email protected], wrote:
|
But why? |
TL;DR for the link: it would be nice to be able to dynamically allocate an array on the stack, for passing a reference down. This is to avoid more costly heap allocation. @tomaka has a real-world use case with the Vulkano wrappers. |
That particular use case would be better covered by |
I thought so too. But I think there is many other use cases too, especially in embedded and kernel development. |
Ha! I didn't even advertise that file. I think I linked it once on IRC. I could use smallvec for most situations. In fact I already use it in glium. It's more of an ideological complaint as using smallvec would waste a lot of space, and not a pragmatic problem. |
Type-level integers will certainly solve some parts of the problem. But alloca would certainly be ideal. My usecases largely stems from kernel code (Redox), etc. Heap allocation is never ideal in these environments. |
I find myself wishing there were stack-allocated arrays quite often. One use-case that comes up: allocate a temporary array, do some processing, tear it down (in the same function). The alternatives to doing it on stack are either temporary heap allocation (heavy price) or allocate and cache (inconvenience, sometimes not even feasible, e.g. when the array is allocated by Iterator::collect(); with stack-allocated arrays it would be possible to extend the collections API to allow collecting into stack-based arrays). |
Another problem of SmallVec is that the generated assembly is really more verbose and sub-optimal. The following code, that uses fixed-size arrays: fn test_array(val1: u32, val2: u32) {
let array = [val1, val2];
ptr(array.as_ptr());
} ...Generates the following assembly: subq $40, %rsp
movabsq $42949672965, %rax
movq %rax, 32(%rsp)
leaq 32(%rsp), %rcx
callq ptr2
addq $40, %rsp
retq Very nice and clean. Using The following code, which uses a Vec: fn test_vec(val1: u32, val2: u32) {
let array = vec![val1, val2];
ptr(array.as_ptr());
} ...Generates this: pushq %rsi
subq $32, %rsp
movl $8, %ecx
movl $4, %edx
callq __rust_allocate
movq %rax, %rsi
testq %rsi, %rsi
je .LBB2_1
movabsq $42949672965, %rax
movq %rax, (%rsi)
movq %rsi, %rcx
callq ptr2
movl $8, %edx
movl $4, %r8d
movq %rsi, %rcx
addq $32, %rsp
popq %rsi
rex64 jmp __rust_deallocate
.LBB2_1:
callq _ZN5alloc3oom3oom17h72360ae2d301e7a7E
ud2 Again you can't do much better. Using a Vec obviously has the drawback that it allocates memory. Finally, this code: fn test_smallvec(val1: u32, val2: u32) {
let mut array: SmallVec<[_; 2]> = SmallVec::new();
array.push(val1);
array.push(val2);
ptr(array.as_ptr());
} ...Generates this: https://gist.github.com/tomaka/dca0e20760486b7933b0409add4434e6 I'm leaving in 10 minutes so I didn't take time to benchmark, but seeing the assembly I already know that the smallvec version is far worse than the fixed-sized-arrays version. Now I'm not even sure it's better than the |
Alternatively, this code: fn test_smallvec(val1: u32, val2: u32) {
let array: SmallVec<[_; 2]> = [val1, val2].iter().cloned().collect();
ptr(array.as_ptr());
} Generates this: https://gist.github.com/tomaka/838e28e480a198d7c4d4bd448d0bf6be |
Ew... It should be possible to significantly improve this by tweaking the implementation of smallvec. Right now, To avoid excessive code in non-constant cases, it would probably help to switch from the current high-level enum representation, which requires switching on the variant each time the capacity or base pointer is loaded, to having dedicated fields for capacity and base pointer which would initially be set as appropriate for the stack buffer. (though that would break moves, sigh.) Either that or split |
Would it be possible to support alloca through a closure approach, like:
Alloca could potentially be implemented through (in order of preference) some LLVM intrinsic, built-in assembly, or even FFI. |
My thoughts are:
|
We can probably "just" make |
@jmesmon Sure - it would be nicer with something that takes As for the unsized type instead of arrays - I haven't seen an example of that up the thread, maybe you could add one so we don't forget this use case? @eddyb, oh, interesting. As for not allowing it by default, maybe a |
It's pretty unclear to me why people want alloca over reserving a worst-case uninitialized buffer on the stack, in the form of a fixed-size array, and using however much of it is required. Kernel code is mentioned; but if you're in the kernel, your stacks are probably not pageable, and taking up RAM whether you use them or not. And if you don't account for worst case stack usage in general, you're liable to crash. So the only way alloca can help is if certain possible call stacks will fit in the available stack space with the actual array sizes required, but not if each function on the stack uses its individual worst case. And either you've carefully measured this and you're sure there's no way for unexpected input to overflow the stack, or else you use some complicated scheme to check at runtime, keeping in mind that after calling alloca, you need enough space left not just for the current function but for anything it might call (including recursively). Oh, and if you mess this up it may be exploitable (depending on whether stack probes are enabled and you have guard pages around your stacks). Of course, allocating large fixed-size buffers can overflow the stack too, but it's easier to catch, because it'll crash whenever the call stack in question is reached rather than depending on the size of the arrays. User code has a bit more of a case. Concerns about unpredictable stack overflow still apply, which is one reason usage of alloca and VLAs in C tends to be heavily discouraged. But it's not a completely binary choice between "fits" and "doesn't fit", since previously unused stack pages may not be taking up real RAM. If you have a big stack buffer, as long as you keep it uninitialized, it won't fault any fully unused 4kb-or-so pages contained in it (even with stack probes, if properly implemented), but whatever goes on the stack /after/ it might be touching pages past the normal high water mark, wasting memory. But I'm a bit skeptical that that matters much in most cases. Especially if your worst case buffer isn't all that large - if it doesn't hit the previous high water mark, it doesn't matter. And if you don't have a lot of threads, the absolute worst possible outcome is "only" faulting the entirety of each thread's stack at a few MB each. |
@comex, thing is, there is many cases where there simply is no worst-case size.
You can already exploit the stack when guard pages are disabled, so alloca isn't particularly special in that sense. |
@ticki Can you elaborate? You say there's no worst-case size, but there's certainly a size past which you'll overflow the stack. So are you dealing with purely trusted input, or do you intend to set a limit based on dynamically calculating the amount of stack space remaining?
(It's special in that if you have guard pages but not stack probes, as tends to be the case for C/C++ programs in userland and some kernels, unchecked allocas let you easily jump past the guard page. Very large fixed-size arrays (4kb is the theoretical minimum but larger tends to be required in practice) can also do that, but they're not that common, and trickier because: whether or not there is a guard page, unchecked allocas allow relatively precise control over where the stack ends up, making exploitation considerably easier. If there isn't a guard page, the biggest threat is probably recursion, just because it's more common than either large arrays or alloca. Of course, as long as you have both guard pages and stack probes, none of this matters since the worst case is "just" a crash.) |
Which I'd expect to abort the same way other allocation failures abort on memory exhaustion. |
@comex, thanks for your well written thoughts. Makes me wonder why I wanted alloca in the first place. But if you're making a library which is supposed to work on both, say, a small device with very little RAM, and a big laptop, and so the small device might have up to 10 items in the array and the laptop might have up to 1000, you can't just hard-code "1000" for your fixed size array, because that'll overflow the small device. |
Obviously there is an upperbound, but what I mean is that it isn't reasonable to, say, make an array of that size, since it is in a reasonable size say 99% of the time. Yet, a special input could yield the upperbound, which would result in stack overflow. Reserving this much space for that worst-case is simply not plausible. |
I'm a bit confused about the meaning of your comment - by "is in" and "yet" do you mean "isn't" and "yes"? I guess you're saying your input is indeed trusted? |
Looks like the RFC PR was never posted here, so: #1808 |
I've recently ran into an issue similar to the ones mentioned earlier. I have a very common pattern in my library where I take a slice of my own types as a function argument, map and collect that slice to a vec of raw pointers, and pass that vec's raw pointer to a FFI call (along with the size), So, it seemed to me that the obvious optimization would be to allocate that contiguous block of memory on the stack instead of the heap, since it is only needed for the scope of the function call and isn't ever returned. Because this pattern is so common, it means many heap allocations are performed on the regular, even though they don't need to be. I get that overflowing the stack is a legitimate concern and I don't know the answer to that. But I do know that for my library, the average input slice is very likely to be small enough not to overflow the stack, and so not being able to allocate a runtime sized array on the stack is a big hindrance to performance. I also feel strongly that setting some arbitrarily chosen compile-time upper bound to the input slice size just to be able to allocate a fixed size stack array (ie via the use of arrayvec, smallvec, or other similar crates) seems like an artificial limitation upon my crate's users and so is a lackluster workaround. Just throwing out an idea that I don't think was mentioned: Would it at all be possible for rust to see if there is enough space on the stack to allocate such an array, and if not, panic? Or some sort of allocation function/macro that returns |
You can definitely add such a function on every Tier 1 platform Rust currently supports. It'd just have to read its own SP and compare it with the OS-defined stack limit. This however raises questions for no_std code, because stack probes are transparent--they check for whether it's possible to access the page by accessing the page. However, such a function would have to interface with the OS explicitly. This is probably worth exploring as a separate crate. |
Well, an alloca only lives for the scope of the function it's created in, right? So I don't think a separate crate could return a stack allocated |
Since llvm has support for eliding allocations now I think it would be much better to ensure the Vec can receive this optimization (in at least some cases) and recommend that. This leads to natural code and simple APIs. For example I think the following should be optimized to a stack allocation.
We could make a couple of functions like this and add a test that there are no heap allocations to prevent regressions. Of course this wouldn't be a guaranteed optimization but it can be made reliable enough for almost all use cases. |
This doesn't help no_std code at all since there's no Vec. |
This is a good point. While I think my solution should still be the recommended approach for std code I see that we would need some sort of no-allocation guaranteed solution. |
@kevincox Your idea about automatically eliding allocations in Vec is good. It doesn't require an RFC (AFAICT) and it is complementary to this feature request, so I recommend filing a new issue in the rust-lang/rust issue tracker requesting that feature. |
I've started to try out some of the proposed ideas in a crate here: https://github.com/TheDan64/scoped_alloca/ It works by creating a wrapper function to clang alloca; and hoping rust's LTO optimizations will inline the wrapper. Seems to work in release(at least in 1.26), but not in debug. (I wonder if we can force the required optimizations for debug?) In particular I've started with a version of @diwic's idea which collects an let v = vec![1, 2, 3, 4];
let iter = v.iter().map(|v| v + 4);
let res: Result<i32, ()> = alloca_collect(iter, |alloca_slice| alloca_slice.iter().sum::<i32>()); Next, I'm planning on looking at the more unsafe/uninitialized variants. I would be happy to get feedback/contributions from anyone who is interested in this topic |
This is really not the right way to go (that would be to implement #1909). Your solution is inherently fragile. |
Sure; it's more exploratory as to how it could look if something similar was integrated into the language rather than being for actual use right now. |
You could implement a Rust alloca intrinsic very easily, but that would (I think) need to go through an RFC first to be integrated. |
We need this feature! Is there any RFC around it? If not, I might write it! 😍 |
Yeah I just read it! Amazing work peeps! <3 |
Closing in favor of rust-lang/rust#48055. |
Am I correct that unsized rvalues have not progressed in a couple years, in large part due to LLVM not handling alignment correctly for alloca? https://github.com/playXE/alloca-rs seemingly solves this problem. |
Issue by thestinger
Friday May 03, 2013 at 01:35 GMT
For earlier discussion, see rust-lang/rust#6203
This issue was labelled with: A-llvm, I-wishlist in the Rust repository
It's probably going to be tricky to get this working with the segmented stack implementation.
The text was updated successfully, but these errors were encountered: