Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the memory64 proposal in Wasmtime #3153

Merged
merged 10 commits into from
Aug 12, 2021

Conversation

alexcrichton
Copy link
Member

This commit implements the WebAssembly memory64 proposal in
both Wasmtime and Cranelift. In terms of work done Cranelift ended up
needing very little work here since most of it was already prepared for
64-bit memories at one point or another. Most of the work in Wasmtime is
largely refactoring, changing a bunch of u32 values to something else.

A number of internal and public interfaces are changing as a result of
this commit, for example:

  • Acessors on wasmtime::Memory that work with pages now all return
    u64 unconditionally rather than u32. This makes it possible to
    accommodate 64-bit memories with this API, but we may also want to
    consider usize here at some point since the host can't grow past
    usize-limited pages anyway.

  • The wasmtime::Limits structure is removed in favor of
    minimum/maximum methods on table/memory types.

  • Many libcall intrinsics called by jit code now unconditionally take
    u64 arguments instead of u32. Return values are usize, however,
    since the return value, if successful, is always bounded by host
    memory while arguments can come from any guest.

  • The heap_addr clif instruction now takes a 64-bit offset argument
    instead of a 32-bit one. It turns out that the legalization of
    heap_addr already worked with 64-bit offsets, so this change was
    fairly trivial to make.

  • The runtime implementation of mmap-based linear memories has changed
    to largely work in usize quantities in its API and in bytes instead
    of pages. This simplifies various aspects and reflects that
    mmap-memories are always bound by usize since that's what the host
    is using to address things, and additionally most calculations care
    about bytes rather than pages except for the very edge where we're
    going to/from wasm.

Overall I've tried to minimize the amount of as casts as possible,
using checked try_from and checked arithemtic with either error
handling or explicit unwrap() calls to tell us about bugs in the
future. Most locations have relatively obvious things to do with various
implications on various hosts, and I think they should all be roughly of
the right shape but time will tell. I mostly relied on the compiler
complaining that various types weren't aligned to figure out
type-casting, and I manually audited some of the more obvious locations.
I suspect we have a number of hidden locations that will panic on 32-bit
hosts if 64-bit modules try to run there, but otherwise I think we
should be generally ok (famous last words). In any case I wouldn't want
to enable this by default naturally until we've fuzzed it for some time.

In terms of the actual underlying implementation, no one should expect
memory64 to be all that fast. Right now it's implemented with
"dynamic" heaps which have a few consequences:

  • All memory accesses are bounds-checked. I'm not sure how aggressively
    Cranelift tries to optimize out bounds checks, but I suspect not a ton
    since we haven't stressed this much historically.

  • Heaps are always precisely sized. This means that every call to
    memory.grow will incur a memcpy of memory from the old heap to the
    new. We probably want to at least look into mremap on Linux and
    otherwise try to implement schemes where dynamic heaps have some
    reserved pages to grow into to help amortize the cost of
    memory.grow.

The memory64 spec test suite is scheduled to now run on CI, but as with
all the other spec test suites it's really not all that comprehensive.
I've tried adding more tests for basic things as I've had to implement
guards for them, but I wouldn't really consider the testing adequate
from just this PR itself. I did try to take care in one test to actually
allocate a 4gb+ heap and then avoid running that in the pooling
allocator or in emulation because otherwise that may fail or take
excessively long.

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 5, 2021

Heaps are always precisely sized. This means that every call to
memory.grow will incur a memcpy of memory from the old heap to the
new. We probably want to at least look into mremap on Linux and
otherwise try to implement schemes where dynamic heaps have some
reserved pages to grow into to help amortize the cost of
memory.grow.

I believe mmap has a flag that prevents overwriting already mapped memory. Combining this with MAP_FIXED would allow growing the heap without moving part of the time.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:meta Everything related to the meta-language. cranelift:wasm fuzzing Issues related to our fuzzing infrastructure wasmtime:c-api Issues pertaining to the C API. labels Aug 5, 2021
@github-actions
Copy link

github-actions bot commented Aug 5, 2021

Subscribe to Label Action

cc @fitzgen, @peterhuene

This issue or pull request has been labeled: "cranelift", "cranelift:meta", "cranelift:wasm", "fuzzing", "wasmtime:c-api"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing
  • peterhuene: wasmtime:c-api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@alexcrichton
Copy link
Member Author

Hm ok so I may need some assistance with that error. Currently we have:

enum InstructionData {
    HeapData(Opcode /* u16 */, Value /* u32 */, ir::Heap /* u32 */, imm: ir::immediates::Uimm32 /* u32 */),
    // ...
}

This results in a 16-byte enum size in total because the discriminant can be packed into the space preceding Opcode. This PR changes the Uimm32 there to Uimm64, which inflates the size of the enum to 24 bytes:

struct Before(u8, u16, u32, u32, u32);
// .. vs ..
struct After(u8, u16, u32, u32, u64);

and there's a test asserting that this enum is 16-bytes in size.

I'm having trouble figuring out how best to shrink this back to 16-bytes. Some things I've tried:

  • Technically Opcode can be removed here since there's only one ever used with HeapAddr. That doesn't free up enough space alone.
  • I'm not sure if there's much use case for having lots of heap indices, so the struct Heap(u32) may be overkill. Chanaging that to u8 I think would reduce the size of this enum back to 16 bytes. Changing to u16 and removing Opcode would also reduce the enum back to 16 bytes.
  • I was also looking at using ir::Constant instead of Uimm64 but that felt somewhat heavyweight for this use case because it seems like it would frequently allocate in the constant pool. There also didn't appear to be many helper functions going from a Constant to a u64 (or some other integral type) so this felt like I was too far off the beaten path.

Do others have alternative ideas that could be implemented, or guidance on which of the above paths is the way to go here?

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 6, 2021

load, store, etc also use 32bit offsets even though they can already be used with 64bit pointers. In cg_clif if the offset overflows the Offset32, I emit a iadd_imm ptr, offset before loading or storing. The same could be done for heap_addr to keep a 32bit offset.

@cfallin
Copy link
Member

cfallin commented Aug 6, 2021

I may be misunderstanding the use of that immediate, but it looks to me like the field is actually used to denote the size of the access, not of the underlying heap (link), on the heap_addr instruction (which appears to be the only one that uses that instruction format). GIven that individual memory accesses definitely don't access more than 4GiB-sized values, I think 32 bits is still OK here?

@alexcrichton
Copy link
Member Author

Heh unfortunately I think you've fallen into the same trap I've fallen into a number of times. The heap_addr instruction does indeed document its immediate as an access size, but that's not technically how the wasm backend is using it. When translating, assuming there's a guard page, we feign the access size as the fixed offset in the memory instruction. We don't technically actually access all those bytes, but for the purposes of validation it's sufficient.

Put another way, because wasm memory offsets can be 64-bits (with memory64), and because we use this offset as the "access size", that's the motivation for increasing the bit-width of the access size.

@cfallin
Copy link
Member

cfallin commented Aug 6, 2021

Hmm, I see; unexpected but I guess that usage makes sense.

IMHO the best option here is probably a hybrid approach during translation: if the offset fits in 32 bits (which will be the case the overwhelming majority of the time, hopefully), do as we presently do; if larger, then emit an explicit add instruction to compute the addr+offset, as @bjorn3 suggests above. The reason I'm hesitant to just accept the 64-bit immediate and corresponding InstructionData growth is that I think the latter will have a surprisingly large impact on compile-time performance; allocating the space only when needed (via another instruction) should be more efficient, even though it's less theoretically clean.

@alexcrichton
Copy link
Member Author

I considered doing the iadd_imm trick but the issue is that overflow needs to be detected (otherwise the 65-bit effective offset isn't correctly calculated). That would involve doing some sort of trap-if-overflow check, but that means that every memory access would have 2 branches instead of one as with this PR. I was hoping to avoid that because I'm assuming the perf hit is already quite large and making it even larger could be worse.

I think it would work for the time being, but I wanted to explore possibilities to update the cranelift instruction. I think it makes sense to not inflate 16 bytes to 24 bytes, but I wasn't having much luck brainstorming possibilities of how to keep it under 16 bytes (unless we want to change heap indexes to u8 and disallow 256+ heaps (which I'm not sure why we'd want anyway)

@cfallin
Copy link
Member

cfallin commented Aug 6, 2021

It would only be two checks for offsets >4GiB, no? I'd expect that most Wasm producers use offsets for things like structs and it would be very strange to have an offset that large, so in my thinking at least, it makes sense to treat this as a "needed for correctness but can be slow" corner. I may be misunderstanding the issue though?

Re: heap count, I'd be hesitant about limiting to 256 heaps; we might bump our heads on that in the future multi-module / component world (e.g. heap per Wasm component, with 300 components in a very large dependency graph -- at least, that's not totally implausible).

@alexcrichton
Copy link
Member Author

Hm nah that's a good point. We already incur a real bounds check on wasm32 for 2gb+ offsets so having an extra bounds check on wasm64 for 4gb+ bounds makes sense. I'll go ahead and implement that doing an addition with a overflow check if the offset doesn't fit in a u32

This commit implements the WebAssembly [memory64 proposal][proposal] in
both Wasmtime and Cranelift. In terms of work done Cranelift ended up
needing very little work here since most of it was already prepared for
64-bit memories at one point or another. Most of the work in Wasmtime is
largely refactoring, changing a bunch of `u32` values to something else.

A number of internal and public interfaces are changing as a result of
this commit, for example:

* Acessors on `wasmtime::Memory` that work with pages now all return
  `u64` unconditionally rather than `u32`. This makes it possible to
  accommodate 64-bit memories with this API, but we may also want to
  consider `usize` here at some point since the host can't grow past
  `usize`-limited pages anyway.

* The `wasmtime::Limits` structure is removed in favor of
  minimum/maximum methods on table/memory types.

* Many libcall intrinsics called by jit code now unconditionally take
  `u64` arguments instead of `u32`. Return values are `usize`, however,
  since the return value, if successful, is always bounded by host
  memory while arguments can come from any guest.

* The `heap_addr` clif instruction now takes a 64-bit offset argument
  instead of a 32-bit one. It turns out that the legalization of
  `heap_addr` already worked with 64-bit offsets, so this change was
  fairly trivial to make.

* The runtime implementation of mmap-based linear memories has changed
  to largely work in `usize` quantities in its API and in bytes instead
  of pages. This simplifies various aspects and reflects that
  mmap-memories are always bound by `usize` since that's what the host
  is using to address things, and additionally most calculations care
  about bytes rather than pages except for the very edge where we're
  going to/from wasm.

Overall I've tried to minimize the amount of `as` casts as possible,
using checked `try_from` and checked arithemtic with either error
handling or explicit `unwrap()` calls to tell us about bugs in the
future. Most locations have relatively obvious things to do with various
implications on various hosts, and I think they should all be roughly of
the right shape but time will tell. I mostly relied on the compiler
complaining that various types weren't aligned to figure out
type-casting, and I manually audited some of the more obvious locations.
I suspect we have a number of hidden locations that will panic on 32-bit
hosts if 64-bit modules try to run there, but otherwise I think we
should be generally ok (famous last words). In any case I wouldn't want
to enable this by default naturally until we've fuzzed it for some time.

In terms of the actual underlying implementation, no one should expect
memory64 to be all that fast. Right now it's implemented with
"dynamic" heaps which have a few consequences:

* All memory accesses are bounds-checked. I'm not sure how aggressively
  Cranelift tries to optimize out bounds checks, but I suspect not a ton
  since we haven't stressed this much historically.

* Heaps are always precisely sized. This means that every call to
  `memory.grow` will incur a `memcpy` of memory from the old heap to the
  new. We probably want to at least look into `mremap` on Linux and
  otherwise try to implement schemes where dynamic heaps have some
  reserved pages to grow into to help amortize the cost of
  `memory.grow`.

The memory64 spec test suite is scheduled to now run on CI, but as with
all the other spec test suites it's really not all that comprehensive.
I've tried adding more tests for basic things as I've had to implement
guards for them, but I wouldn't really consider the testing adequate
from just this PR itself. I did try to take care in one test to actually
allocate a 4gb+ heap and then avoid running that in the pooling
allocator or in emulation because otherwise that may fail or take
excessively long.

[proposal]: https://github.com/WebAssembly/memory64/blob/master/proposals/memory64/Overview.md
This commit updates the generation of addresses in wasm code to always
use 32-bit offsets for `heap_addr`, and if the calculated offset is
bigger than 32-bits we emit a manual add with an overflow check.
@alexcrichton
Copy link
Member Author

For reviewers, extra care when reviewing cranelift/wasm/src/code_translator.rs is likely in order. That's some tricky code that's pretty meticulously set up. It has large effects on performance and is the lynchpin of wasm heap correctness, so just want to triple-check I'm right there.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the codegen side; thanks! Just two notes below where I think extra comments might help future explorers of these dark, twisty caves.

@@ -2222,25 +2218,65 @@ fn prepare_addr<FE: FuncEnvironment + ?Sized>(
// offsets we're checking here are zero. This means that we'll hit the fast
// path and emit zero conditional traps for bounds checks
let adjusted_offset = if offset_guard_size == 0 {
u64::from(offset) + u64::from(access_size)
memarg.offset.saturating_add(u64::from(access_size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch (avoiding wraparound). Could you add a comment re: the saturation here to note this is protecting against offset + access-size overflowing?

// pessimized a fair amount. We can't pass the fixed-sized offset to
// the `heap_addr` instruction, so we need to fold the offset into the
// address itself. In doing so though we need to check for overflow
// because that would mean the address is out-of-bounds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here about why we decided not to expand heap_addr to take the u64 would help document our discussion :-)

@alexcrichton
Copy link
Member Author

alexcrichton commented Aug 11, 2021

@peterhuene oh I realize now I tagged you here without much context. Would you be ok reviewing the Wasmtime-specific pieces of this? If not no worries, I'm sure I can find another unwilling victim candidate.

@peterhuene
Copy link
Member

👍 I'll review it now.

Copy link
Member

@peterhuene peterhuene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over the Wasmtime and test changes (skipped the wast tests sourced from the test suite).

Looks great 👍 . Just some nit comment updates.


/// Grow memory by the specified amount of wasm pages.
/// Grow memory to the specified amount of bytes.
///
/// Returns `None` if memory can't be grown by the specified amount
/// of wasm pages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this commit is now out of date as it's now taking a size rather than page delta?

///
/// Returns `None` if memory can't be grown by the specified amount
/// of wasm pages.
fn grow(&mut self, delta: u32) -> Option<u32>;
/// of wasm pages. Returns `Some` if memory was grown successfully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above re: mention of wasm pages.

@alexcrichton
Copy link
Member Author

Thanks @peterhuene!

@alexcrichton alexcrichton merged commit e68aa99 into bytecodealliance:main Aug 12, 2021
@alexcrichton alexcrichton deleted the memory64 branch August 12, 2021 14:40
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Aug 12, 2021
alexcrichton added a commit that referenced this pull request Aug 12, 2021
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Aug 12, 2021
* Implement the memory64 proposal in Wasmtime

This commit implements the WebAssembly [memory64 proposal][proposal] in
both Wasmtime and Cranelift. In terms of work done Cranelift ended up
needing very little work here since most of it was already prepared for
64-bit memories at one point or another. Most of the work in Wasmtime is
largely refactoring, changing a bunch of `u32` values to something else.

A number of internal and public interfaces are changing as a result of
this commit, for example:

* Acessors on `wasmtime::Memory` that work with pages now all return
  `u64` unconditionally rather than `u32`. This makes it possible to
  accommodate 64-bit memories with this API, but we may also want to
  consider `usize` here at some point since the host can't grow past
  `usize`-limited pages anyway.

* The `wasmtime::Limits` structure is removed in favor of
  minimum/maximum methods on table/memory types.

* Many libcall intrinsics called by jit code now unconditionally take
  `u64` arguments instead of `u32`. Return values are `usize`, however,
  since the return value, if successful, is always bounded by host
  memory while arguments can come from any guest.

* The `heap_addr` clif instruction now takes a 64-bit offset argument
  instead of a 32-bit one. It turns out that the legalization of
  `heap_addr` already worked with 64-bit offsets, so this change was
  fairly trivial to make.

* The runtime implementation of mmap-based linear memories has changed
  to largely work in `usize` quantities in its API and in bytes instead
  of pages. This simplifies various aspects and reflects that
  mmap-memories are always bound by `usize` since that's what the host
  is using to address things, and additionally most calculations care
  about bytes rather than pages except for the very edge where we're
  going to/from wasm.

Overall I've tried to minimize the amount of `as` casts as possible,
using checked `try_from` and checked arithemtic with either error
handling or explicit `unwrap()` calls to tell us about bugs in the
future. Most locations have relatively obvious things to do with various
implications on various hosts, and I think they should all be roughly of
the right shape but time will tell. I mostly relied on the compiler
complaining that various types weren't aligned to figure out
type-casting, and I manually audited some of the more obvious locations.
I suspect we have a number of hidden locations that will panic on 32-bit
hosts if 64-bit modules try to run there, but otherwise I think we
should be generally ok (famous last words). In any case I wouldn't want
to enable this by default naturally until we've fuzzed it for some time.

In terms of the actual underlying implementation, no one should expect
memory64 to be all that fast. Right now it's implemented with
"dynamic" heaps which have a few consequences:

* All memory accesses are bounds-checked. I'm not sure how aggressively
  Cranelift tries to optimize out bounds checks, but I suspect not a ton
  since we haven't stressed this much historically.

* Heaps are always precisely sized. This means that every call to
  `memory.grow` will incur a `memcpy` of memory from the old heap to the
  new. We probably want to at least look into `mremap` on Linux and
  otherwise try to implement schemes where dynamic heaps have some
  reserved pages to grow into to help amortize the cost of
  `memory.grow`.

The memory64 spec test suite is scheduled to now run on CI, but as with
all the other spec test suites it's really not all that comprehensive.
I've tried adding more tests for basic things as I've had to implement
guards for them, but I wouldn't really consider the testing adequate
from just this PR itself. I did try to take care in one test to actually
allocate a 4gb+ heap and then avoid running that in the pooling
allocator or in emulation because otherwise that may fail or take
excessively long.

[proposal]: https://github.com/WebAssembly/memory64/blob/master/proposals/memory64/Overview.md

* Fix some tests

* More test fixes

* Fix wasmtime tests

* Fix doctests

* Revert to 32-bit immediate offsets in `heap_addr`

This commit updates the generation of addresses in wasm code to always
use 32-bit offsets for `heap_addr`, and if the calculated offset is
bigger than 32-bits we emit a manual add with an overflow check.

* Disable memory64 for spectest fuzzing

* Fix wrong offset being added to heap addr

* More comments!

* Clarify bytes/pages
slumber added a commit to paritytech/substrate that referenced this pull request Oct 11, 2021
Changes related to memory64 proposal implementation,
for additional details see bytecodealliance/wasmtime#3153
ghost pushed a commit to paritytech/substrate that referenced this pull request Oct 12, 2021
* sc-executor-wasmtime: upgrade wasmtime to 0.30.0

Changes related to memory64 proposal implementation,
for additional details see bytecodealliance/wasmtime#3153

* sc-executor-wasmtime: introduce parallel_compilation flag

* typos
kpreisser added a commit to kpreisser/wasmtime-dotnet that referenced this pull request Oct 23, 2022
…codealliance/wasmtime#3153, in order to support 64-bit memory.

This also fixes the definition of native functions wasmtime_memory_size and wasmtime_memory_grow, which previously incorrectly used a UInt32 instead of UInt64, and config functions taking a bool parameter, which were previously missing the MarshalAsAttribute (to marshal a Boolean as 1 byte instead of 4 bytes).
peterhuene pushed a commit to bytecodealliance/wasmtime-dotnet that referenced this pull request Oct 31, 2022
* Add support for enabling the Memory64 proposal, which was added to the Wasmtime C API with bytecodealliance/wasmtime#3182.

Additionally, add tests to verify the new APIs added with #166 are able to access a memory beyond the 4 GiB area.

* Follow-Up: Adjust the API to match the new C API introduced with bytecodealliance/wasmtime#3153, in order to support 64-bit memory.

This also fixes the definition of native functions wasmtime_memory_size and wasmtime_memory_grow, which previously incorrectly used a UInt32 instead of UInt64, and config functions taking a bool parameter, which were previously missing the MarshalAsAttribute (to marshal a Boolean as 1 byte instead of 4 bytes).

* Follow-Up: Validate the arguments.

* Simplify.

* Simplify.
lwshang added a commit to lwshang/wasmtime that referenced this pull request Sep 11, 2024
This commit implements the table64 extention in both Wasmtime and
Cranelift.

Most of the work was changing a bunch of u32 values to u64/usize.
The decisions were made in align with the PR bytecodealliance#3153 which
implemented the memory64 propsal itself.

One significant change was the introduction of `IndexType`
and `Limits` which streamline and unify the handling of limits
for both memories and tables.

The spec and fuzzing tests related to table64 are re-enabled which
provides a good coverage of the feature.
github-merge-queue bot pushed a commit that referenced this pull request Sep 11, 2024
This commit implements the table64 extention in both Wasmtime and
Cranelift.

Most of the work was changing a bunch of u32 values to u64/usize.
The decisions were made in align with the PR #3153 which
implemented the memory64 propsal itself.

One significant change was the introduction of `IndexType`
and `Limits` which streamline and unify the handling of limits
for both memories and tables.

The spec and fuzzing tests related to table64 are re-enabled which
provides a good coverage of the feature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:meta Everything related to the meta-language. cranelift:wasm cranelift Issues related to the Cranelift code generator fuzzing Issues related to our fuzzing infrastructure wasmtime:c-api Issues pertaining to the C API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants