Arena allocation using bumpalo (and hashbrown with custom alloc). #115

cfallin · 2021-01-16T04:17:20Z

This PR modifies both linear scan and BT allocators (and their shared
analyses) to pervasively use bumpalo for memory allocation from an
arena that lives for the duration of a function's register allocation.
This should result in significant performance improvements (I have not
yet measured, though), as we have seen malloc/free to be significant
parts of CPU-time profiles in the past.

Care must be taken to not use normal heap-allocated types inside of a
BumpVec, BumpMap or BumpSet, because the arena does not invoke
Drop impls when it is finally discarded (this is what allows freeing
the memory to be so fast; we just discard the entire block at once). As
such, many uses of SmallVec and some other types had to change. Some
cases that I had missed were picked up by LeakSanitizer in fuzzing runs,
and these errors no longer occur, so I'm relatively satisfied that we've
covered them all.

This changes the public interface -- the caller must provide the arena,
and some of the returned metadata is allocated within it -- so it will
require corresponding updates in e.g. Cranelift. That will come in
subsequent PRs.

Thanks to @bnjbvr for noting in our discussion this morning that memory
allocation is still significant in profiles and thus inspiring me to
dive into this :-)

@bnjbvr

This PR modifies both linear scan and BT allocators (and their shared analyses) to pervasively use `bumpalo` for memory allocation from an arena that lives for the duration of a function's register allocation. This should result in significant performance improvements (I have not yet measured, though), as we have seen malloc/free to be significant parts of CPU-time profiles in the past. Care must be taken to *not* use normal heap-allocated types inside of a `BumpVec`, `BumpMap` or `BumpSet`, because the arena does not invoke `Drop` impls when it is finally discarded (this is what allows freeing the memory to be so fast; we just discard the entire block at once). As such, many uses of `SmallVec` and some other types had to change. Some cases that I had missed were picked up by LeakSanitizer in fuzzing runs, and these errors no longer occur, so I'm relatively satisfied that we've covered them all. This changes the public interface -- the caller must provide the arena, and some of the returned metadata is allocated within it -- so it will require corresponding updates in e.g. Cranelift. That will come in subsequent PRs. Thanks to @bnjbvr for noting in our discussion this morning that memory allocation is still significant in profiles and thus inspiring me to dive into this :-)

cfallin · 2021-01-16T04:18:28Z

Also, note that I had to make a slight tweak to rust-lang/hashbrown to expose one type; I'll create a separate PR over there. We'll need that crate to do a point release that we can refer to before this PR merges.

Export AllocError as well as Allocator. This PR re-exports the `AllocError` type alongside `Allocator` and others from the `raw` submodule. This seems to be necessary to implement the `Allocator` trait from outside the crate (and I want to use exported types from this crate so that I don't have to depend on the unstable allocator API directly). It allowed me to instantiate hashbrown hashmaps using `bumpalo` over in bytecodealliance/regalloc.rs#115; I couldn't work out another way to do this. If there is a more direct way, please do let me know! (Thanks for this excellent crate and the configurable allocator interface, in any case!)

cfallin · 2021-01-16T18:08:41Z

I just updated to use FxHashMap again, after some discussion with Julian and Ben; Ben's initial benchmark showed this to be slower than expected but we think that having reverted to the default hasher could explain it. Hopefully this one's faster!

…).

cfallin · 2021-01-17T01:16:32Z

Added a BumpSmallVec that behaves like a SmallVec except uses arena storage when it outgrows its inline space. I think I changed every place a SmallVec was originally used to use this.

@bnjbvr, cargo-deny is throwing a CI error because we have cfg-if 1.1.0 deep in one dependency tree, and cfg-if 1.0.0 deep in another. This is new since adding hashbrown in this PR. Do we want to perhaps be more permissive about such duplication?

…).

Amanieu · 2021-01-17T01:25:06Z

Added a BumpSmallVec that behaves like a SmallVec except uses arena storage when it outgrows its inline space. I think I changed every place a SmallVec was originally used to use this.

It might be worth simply using BumpVecs initialized using with_capacity here instead. This avoids the branch on every access to the SmallVec and the initial allocation is very quick anyways.

cfallin · 2021-01-17T01:34:29Z

Added a BumpSmallVec that behaves like a SmallVec except uses arena storage when it outgrows its inline space. I think I changed every place a SmallVec was originally used to use this.

It might be worth simply using BumpVecs initialized using with_capacity here instead. This avoids the branch on every access to the SmallVec and the initial allocation is very quick anyways.

Indeed, that's what we had before that change; I was curious whether the pointer indirection / reduced locality of placing data on the heap always (vs. keeping it on-stack when small) might be causing a performance loss. I just ran some tests (after pushing the update) and it seems it doesn't make too much of a difference either way.

@bnjbvr, a perf report view on a run with LSRA shows ~0.6% time spent in clear_page_rep in the kernel without the arena-alloc change, vs. 1.7% with it; as that's a reasonable proxy for "demand-allocated fresh memory" (map_anon gets zeroed first) I think that means we're touching more memory and losing locality as a consequence of not freeing and reusing data at a finer granularity.

I suspect that if we put some thought into a hierarchical scheme of some sort -- subarenas within the arena -- maybe we could get some of that back. There's no reason we shouldn't be able to get a lot of benefit from bump-allocating e.g. all of the register mention lists and range/fragment lists; maybe there are smaller temporary objects that we should be freeing earlier?

cfallin · 2022-02-09T22:22:22Z

Cleaning up old PRs and closing this now.

cfallin requested a review from bnjbvr January 16, 2021 04:17

cfallin mentioned this pull request Jan 16, 2021

Export AllocError as well as Allocator. rust-lang/hashbrown#223

Merged

Switch back to FxHash{Map,Set} on top of hashbrown/bumpalo.

5c1e34c

cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 16, 2021

Use regalloc.rs with arena allocation (bytecodealliance/regalloc.rs#115…

db819bd

…).

cfallin added 3 commits January 16, 2021 14:13

Use updated hashbrown with AllocError export.

fad4636

Bump-allocated SmallVec impl.

b6ebf24

Use BumpSmallVec where SmallVec was previously used.

87e4f9f

cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 17, 2021

Use regalloc.rs with arena allocation (bytecodealliance/regalloc.rs#115…

fa4bdb1

…).

cfallin closed this Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arena allocation using bumpalo (and hashbrown with custom alloc). #115

Arena allocation using bumpalo (and hashbrown with custom alloc). #115

cfallin commented Jan 16, 2021

cfallin commented Jan 16, 2021

cfallin commented Jan 16, 2021

cfallin commented Jan 17, 2021

Amanieu commented Jan 17, 2021

cfallin commented Jan 17, 2021

cfallin commented Feb 9, 2022

Arena allocation using bumpalo (and hashbrown with custom alloc). #115

Arena allocation using bumpalo (and hashbrown with custom alloc). #115

Conversation

cfallin commented Jan 16, 2021

cfallin commented Jan 16, 2021

cfallin commented Jan 16, 2021

cfallin commented Jan 17, 2021

Amanieu commented Jan 17, 2021

cfallin commented Jan 17, 2021

cfallin commented Feb 9, 2022