Skip to content
This repository has been archived by the owner on Oct 4, 2022. It is now read-only.

Arena allocation using bumpalo (and hashbrown with custom alloc). #115

Closed
wants to merge 5 commits into from

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jan 16, 2021

This PR modifies both linear scan and BT allocators (and their shared
analyses) to pervasively use bumpalo for memory allocation from an
arena that lives for the duration of a function's register allocation.
This should result in significant performance improvements (I have not
yet measured, though), as we have seen malloc/free to be significant
parts of CPU-time profiles in the past.

Care must be taken to not use normal heap-allocated types inside of a
BumpVec, BumpMap or BumpSet, because the arena does not invoke
Drop impls when it is finally discarded (this is what allows freeing
the memory to be so fast; we just discard the entire block at once). As
such, many uses of SmallVec and some other types had to change. Some
cases that I had missed were picked up by LeakSanitizer in fuzzing runs,
and these errors no longer occur, so I'm relatively satisfied that we've
covered them all.

This changes the public interface -- the caller must provide the arena,
and some of the returned metadata is allocated within it -- so it will
require corresponding updates in e.g. Cranelift. That will come in
subsequent PRs.

Thanks to @bnjbvr for noting in our discussion this morning that memory
allocation is still significant in profiles and thus inspiring me to
dive into this :-)

This PR modifies both linear scan and BT allocators (and their shared
analyses) to pervasively use `bumpalo` for memory allocation from an
arena that lives for the duration of a function's register allocation.
This should result in significant performance improvements (I have not
yet measured, though), as we have seen malloc/free to be significant
parts of CPU-time profiles in the past.

Care must be taken to *not* use normal heap-allocated types inside of a
`BumpVec`, `BumpMap` or `BumpSet`, because the arena does not invoke
`Drop` impls when it is finally discarded (this is what allows freeing
the memory to be so fast; we just discard the entire block at once). As
such, many uses of `SmallVec` and some other types had to change. Some
cases that I had missed were picked up by LeakSanitizer in fuzzing runs,
and these errors no longer occur, so I'm relatively satisfied that we've
covered them all.

This changes the public interface -- the caller must provide the arena,
and some of the returned metadata is allocated within it -- so it will
require corresponding updates in e.g. Cranelift. That will come in
subsequent PRs.

Thanks to @bnjbvr for noting in our discussion this morning that memory
allocation is still significant in profiles and thus inspiring me to
dive into this :-)
@cfallin cfallin requested a review from bnjbvr January 16, 2021 04:17
@cfallin
Copy link
Member Author

cfallin commented Jan 16, 2021

Also, note that I had to make a slight tweak to rust-lang/hashbrown to expose one type; I'll create a separate PR over there. We'll need that crate to do a point release that we can refer to before this PR merges.

bors added a commit to rust-lang/hashbrown that referenced this pull request Jan 16, 2021
Export AllocError as well as Allocator.

This PR re-exports the `AllocError` type alongside `Allocator` and others from the `raw` submodule.

This seems to be necessary to implement the `Allocator` trait from outside the crate (and I want to use exported types from this crate so that I don't have to depend on the unstable allocator API directly). It allowed me to instantiate hashbrown hashmaps using `bumpalo` over in bytecodealliance/regalloc.rs#115; I couldn't work out another way to do this. If there is a more direct way, please do let me know! (Thanks for this excellent crate and the configurable allocator interface, in any case!)
@cfallin
Copy link
Member Author

cfallin commented Jan 16, 2021

I just updated to use FxHashMap again, after some discussion with Julian and Ben; Ben's initial benchmark showed this to be slower than expected but we think that having reverted to the default hasher could explain it. Hopefully this one's faster!

cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 16, 2021
@cfallin
Copy link
Member Author

cfallin commented Jan 17, 2021

Added a BumpSmallVec that behaves like a SmallVec except uses arena storage when it outgrows its inline space. I think I changed every place a SmallVec was originally used to use this.

@bnjbvr, cargo-deny is throwing a CI error because we have cfg-if 1.1.0 deep in one dependency tree, and cfg-if 1.0.0 deep in another. This is new since adding hashbrown in this PR. Do we want to perhaps be more permissive about such duplication?

cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 17, 2021
@Amanieu
Copy link

Amanieu commented Jan 17, 2021

Added a BumpSmallVec that behaves like a SmallVec except uses arena storage when it outgrows its inline space. I think I changed every place a SmallVec was originally used to use this.

It might be worth simply using BumpVecs initialized using with_capacity here instead. This avoids the branch on every access to the SmallVec and the initial allocation is very quick anyways.

@cfallin
Copy link
Member Author

cfallin commented Jan 17, 2021

Added a BumpSmallVec that behaves like a SmallVec except uses arena storage when it outgrows its inline space. I think I changed every place a SmallVec was originally used to use this.

It might be worth simply using BumpVecs initialized using with_capacity here instead. This avoids the branch on every access to the SmallVec and the initial allocation is very quick anyways.

Indeed, that's what we had before that change; I was curious whether the pointer indirection / reduced locality of placing data on the heap always (vs. keeping it on-stack when small) might be causing a performance loss. I just ran some tests (after pushing the update) and it seems it doesn't make too much of a difference either way.

@bnjbvr, a perf report view on a run with LSRA shows ~0.6% time spent in clear_page_rep in the kernel without the arena-alloc change, vs. 1.7% with it; as that's a reasonable proxy for "demand-allocated fresh memory" (map_anon gets zeroed first) I think that means we're touching more memory and losing locality as a consequence of not freeing and reusing data at a finer granularity.

I suspect that if we put some thought into a hierarchical scheme of some sort -- subarenas within the arena -- maybe we could get some of that back. There's no reason we shouldn't be able to get a lot of benefit from bump-allocating e.g. all of the register mention lists and range/fragment lists; maybe there are smaller temporary objects that we should be freeing earlier?

@cfallin
Copy link
Member Author

cfallin commented Feb 9, 2022

Cleaning up old PRs and closing this now.

@cfallin cfallin closed this Feb 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants