Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially out-of-bound writes #1490

Open
sparker-arm opened this issue Oct 13, 2023 · 62 comments
Open

Partially out-of-bound writes #1490

sparker-arm opened this issue Oct 13, 2023 · 62 comments

Comments

@sparker-arm
Copy link

The semantics of partially OOB writes was clarified in #1025. This has proved problematic for runtimes running on Arm, where there is no such guarantee in the architecture. I think all now browsers implement the spec properly, but I'm not sure if any of the wasm-specific non-browser runtimes do.

I could see that having deterministic memory contents is a nice property, but I'm not sure of what the real advantages are, could someone enlighten me?

And do you think there's any room to relax the spec in this regard, possibly even through an extension like relaxed-simd?

@conrad-watt
Copy link
Contributor

conrad-watt commented Oct 13, 2023

As one point, deterministic behaviour helps to generally ensure that different implementation decisions in different engines don't lead to observably divergent behaviours (even on a single underlying system). Presumably this would be relevant here when comparing engines implementing load/store with explicit bounds check vs guard pages.

How do the browsers using guard pages currently achieve conformant behaviour on Arm?

@sparker-arm
Copy link
Author

I'm not sure for Firefox: here's a lengthy thread about the issues they've found and it's not just Arm specific.

For V8, though I don't think I've actually enabled it yet for Chromium, we can rely on trap handlers for loads but have to perform manual bounds checks for stores.

deterministic behaviour helps to generally ensure that different implementation decisions in different engines don't lead to observably divergent behaviours

Sure, but I'm still struggling to understand in what situations it is useful, I don't understand how it helps once we trap. Can it help in some debug situations?

@conrad-watt
Copy link
Contributor

conrad-watt commented Oct 13, 2023

Sure, but I'm still struggling to understand in what situations it is useful, I don't understand how it helps once we trap. Can it help in some debug situations?

Sorry, I rushed to write the previous comment and didn't work through the reasoning properly. On the Web, it's possible to catch Wasm traps in JS and continue execution (and observe the state of the memory). We have a lot of previous experience with tiny edge-cases of the semantics being relied on by some Web program so there would be some inertia against introducing further non-determinism here. That being said, if browsers currently aren't (or even can't) implementing this conformantly (as the Firefox issue above suggests), that would be an argument that we could relax things, since existing Web programs couldn't be relying on conformant behaviour.

@sparker-arm
Copy link
Author

Okay, thanks.

From bug trackers though, it seems it only became 'a problem' once the necessary tests were added to the wasm spec test-suite. And I'd just like re-iterate that most (all) wasm runtimes only pass this test, on Arm hardware, because of microarchitecture details (luck!) so it's not just browsers who don't rely on conformant programs... it's basically everything. It seems it is both difficult to implement two policies and it also mean giving up ~10-20% execution performance when using explicit checks.

@conrad-watt
Copy link
Contributor

It seems it is both difficult to implement two policies and it also mean giving up ~10-20% execution performance when using explicit checks.

Yeah, if being truly conformant requires engines to totally give up on the guard page strategy, I would (personally) count that as "can't implement conformantly" and (personally) argue that the semantics should be relaxed.

@eqrion am I understanding correctly that today in "release" Firefox, it's possible to craft a program that (in some platform) can observably write bytes to memory in the case of a partially OOB write?

@rossberg
Copy link
Member

Even if we were to relax this, we would need to fix a behaviour in deterministic mode, which some platforms have a hard reliance on. Is the only solution for them to not use guard pages, or not use Arm?

@sparker-arm
Copy link
Author

sparker-arm commented Oct 13, 2023

I don't have an answer of how else to get around it... Maybe the penalty of explicit checks is worth the absolute determinism? Having to exclude an underlying hardware implementation doesn't sound very wasmy. Would you mind sending me some link(s) to the aforementioned platforms? I would really like to know about some concrete use-cases.

EDIT: On a single aarch64 Linux system, I found the execution time overhead of explicit checks, for stores only, to be ~5% across an arbitrary set of benchmarks.

@rossberg
Copy link
Member

One particular sensitive class of examples is Blockchain platforms running Wasm, because they rely on consensus between replicated distributed computations, possibly across heterogeneous hardware. That is only meaningful if all executions are guaranteed to be fully deterministic, including in cases where traps occur. There is bunch of these platforms, such as Dfinity, Parity, etc.

@sparker-arm
Copy link
Author

Thank you! I feared blockchain maybe the answer :)

So, sorry to get a little side-tracked, but I think I'm getting lost among the jargon... I see you are/were affiliated with Dfinity, so maybe you know for them, does their platform run on aarch64? And do you know whether most of these platforms are using wasmtime/cranelift? (IIRC, cranelift does not produce conforming code.)

I can see that Parity have their own interpreter, do you know if rolling-your-own interpreters/compilers is a popular choice?

@tschneidereit
Copy link
Member

Does the blockchain use case require consensus on failed executions? I'd have expected that a result is thrown out if it ends in a trap, and that that is effectively the way to get to deterministic outcomes for that case.

@titzer
Copy link

titzer commented Oct 13, 2023

@sparker-arm Can you clarify which arm architectures might be affected? Does this apply to modern processors and arm64, or is this limited to certain classes of arm processors?

@grubbymits
Copy link

Can you clarify which arm architectures might be affected

To be pedantic, no arm architecture provides the guaranteed semantics. It is implementation defined. The problems, generally, do not arise during testing because it's either running in an emulator or on a microarchitecture that does provide the semantics. For arm designed cores, it can be loosely described as the big cores match wasm but the small cores do not. I don't have an exhaustive list. Apple silicon doesn't suffer from this inconsistency.

@rossberg
Copy link
Member

@sparker-arm, afaik, Dfinity currently only runs on x64 nodes, but it is not intended to stay like that. For their use case at least, an interpreter would be too slow, because it's a 3rd gen blockchain that is meant to support full-featured on-chain applications.

@tschneidereit, consensus also applies to trapping executions, since they have to materialise as a rejected message at the call site (for distributed message calls), so are an integral and well-defined part of the execution model.

That said, I'm sure there are other use cases for deterministic execution that do not care about the failure case. Reproducible execution/builds could fall into that category (not sure).

@fitzgen
Copy link
Contributor

fitzgen commented Oct 13, 2023

it also mean giving up ~10-20% execution performance when using explicit checks.

FWIW, the overheads can be much greater than that, depending on the benchmark. I've measured ~1.55x slow downs in both Wasmtime and SpiderMonkey on real world programs (spidermonkey.wasm running a JS markdown renderer).

@cfallin
Copy link

cfallin commented Oct 13, 2023

@sparker-arm Is there any feature flag or other way to detect whether the underlying hardware does a partial write or not (other than, say, the runtime performing one and catching the signal at startup)? Is it at least guaranteed that the hardware behaves the same way consistently? (Would big.LITTLE throw a wrench into that perhaps?)

I'm also curious about other architectures: does anyone know offhand if e.g. RISC-V or other ISAs guarantee anything about partially-trapping writes?

@cfallin
Copy link

cfallin commented Oct 13, 2023

@sparker-arm one other thought: is it sufficient, architecturally, to do a load (and throw away the result) to a stored-to address before every store? We would then take the fault on the load instead, without side-effects. Or does one need a fence to ensure the store doesn't reorder in that case?

@sparker-arm
Copy link
Author

Would big.LITTLE throw a wrench into that perhaps?)

Yup! As my alter ego 'grubbymits' accidentally mentioned, I believe almost all configurations of big and small would be incompatible. I have a feeling there maybe a single big armv8 that behaves differently to the rest.

@sparker-arm
Copy link
Author

@sparker-arm one other thought: is it sufficient, architecturally, to do a load (and throw away the result) to a stored-to address before every store? We would then take the fault on the load instead, without side-effects. Or does one need a fence to ensure the store doesn't reorder in that case?

I don't know, I need to speak to one of our memory model experts. I suspect that it may not work without atomics.

@cfallin
Copy link

cfallin commented Oct 13, 2023

Thinking about the load-before-store idea a bit more: a page fault is a precise exception, in the sense that all side-effects of earlier instructions should occur, and none of later instructions should, correct? (The concern here is just about precision within the sub-operations of a single instruction's write.) In other words, if I have a load that faults on page A, and then later in the program, a store to address B, the architecture should unambiguously state that the store to B does not occur (or may occur speculatively, but we're only concerned with final architectural state here)? Otherwise, you couldn't restart after a page fault, because A and B might alias.

@sparker-arm
Copy link
Author

Even if this does work, I don't really think we should be coming up with hacks to get WebAssembly to run well on a very prominent architecture. It seems clear that the design decision was made without having enough information, and it appears that most runtimes continue to ignore the spec, so maybe the spec should change...

From the crypto use-case, it seems they all provide an SDK so they can fully control the target wasm feature set, and so they can disable anything that could introduce in-determinism. If this is the main case for fully deterministic memory, I wonder if we could have this as an opt-in feature?

@rossberg
Copy link
Member

@sparker-arm, deterministic mode (introduced with the Relaxed SIMD proposal, which is similarly messy) is what I was getting at. But we'll still need to specify some concrete behaviour for that. I am trying to understand if this mode would even be implementable on Arm without severe performance hits, and what the cheapest semantics would be for that.

@sparker-arm
Copy link
Author

Well, is there any way that we can implement alignment guarantees in WebAssembly? (I'm not sure I understand the utility of alignment hints.)

@akirilov-arm
Copy link

@cfallin

Is there any feature flag or other way to detect whether the underlying hardware does a partial write or not (other than, say, the runtime performing one and catching the signal at startup)? Is it at least guaranteed that the hardware behaves the same way consistently?

Sadly, the answer to all your questions is no. And yes, architecturally speaking there is no requirement that a particular implementation/microarchitecture/CPU behaves consistently, and that is even if we ignore the existence of technologies such as big.LITTLE (e.g. a single-core system).

@cfallin
Copy link

cfallin commented Oct 16, 2023

Even if this does work, I don't really think we should be coming up with hacks to get WebAssembly to run well on a very prominent architecture. It seems clear that the design decision was made without having enough information, and it appears that most runtimes continue to ignore the spec, so maybe the spec should change...

It'd still be useful from my perspective at least to have a definitive answer on the load-before-store idea: if the spec doesn't change, then this is the only reasonable option I see to keep competitive Wasm performance on aarch64, unless I'm missing some other option!

@akirilov-arm
Copy link

Also, to be precise, on AArch64 the behaviour is not implementation-defined in the sense that an implementation is supposed to choose a specific behaviour and to stick to it - architecturally the contents of the in-bound portion of the memory affected by a store become unknown (so in theory they could become zero, for instance, even if the value to be stored is non-zero).

@sparker-arm
Copy link
Author

consensus also applies to trapping executions, since they have to materialise as a rejected message at the call site

@rossberg I'm still trying to understand the mechanics for consensus, do the linear memory of the instances have to be compared?

@rossberg
Copy link
Member

rossberg commented Oct 17, 2023

@sparker-arm, sort of. In the abstract, each node computes some form of cryptographic hash of the state changes it has performed, including the memory contents. Virtual memory techniques to identify dirtied pages at the end of execution make that practical and fairly efficient even for large memories, because the number of touched pages is bounded by the gas limit.

That said, we shouldn't narrow the consideration to the obscure needs of blockchains. Portable, deterministic execution and the absence of undefined behaviour (even where non-determinism is allowed) has been one of the original design goals and a big selling point of Wasm. We somehow need to maintain it. Demoting determinism to an opt-in mode already is a compromise that not everybody was happy with.

@sparker-arm
Copy link
Author

Okay, so I'm now trying to reason what is different about this case of non-determinism compared to the existing standard, and incoming extensions. Given that wasm can produce non-deterministic NaN values, this case doesn't feel wildly different to me.

I see that NaN values are normalised with instrumentation in the Dfinity SDK/runtime? Could partial writes be handled in the same way? Maybe even by breaking up stores into byte stores? It just seems there's existing precedence for working around the limitations for runtimes that need it.

@rossberg
Copy link
Member

rossberg commented Oct 17, 2023

It's in fact Wasmtime that does the NaN normalisation, it has a configuration option for that. That roughly corresponds to the idea of having a deterministic mode, which this engine (and possibly others) supports.

It would be okay-ish to only have deterministic behaviour of OOB writes in deterministic mode. But even then we need to resolve two questions:

  1. In "performance" mode, the behaviour must still be well-defined. In the worst case, this definition could amount to complete non-determinism (could write any value whatsoever), but preferably, we should narrow it down more than that.

  2. For deterministic mode we must specify a fixed behaviour. This behaviour should be implementable without excessive overhead and without burdening engines significantly. Breaking up all stores into individual bytes does not qualify, AFAIAC.

@sparker-arm
Copy link
Author

I guess it's mostly a theoretical ISA spec issue, unless there are small cores that do actually do this?

All little cores, AFAIK. The big cores from our Sophia design team would also break wasm, which I believe is the Cortex-A73 and A75.

disallow any access to the state of memory after a trap; but that's far too useful to give up IMHO

Just wondering whether you have use for it outside of debug? I would definitely like clarification of what the state of the store should be when a program is reduced to a single trap.

@rossberg
Copy link
Member

rossberg commented Oct 18, 2023

@sparker-arm, not sure. It would be a value that was neither in the memory before the write nor was written to it. Although thin-air semantics typically means arbitrary value, while this would be a fixed one.

@cfallin, (i) does not answer the question what to do in deterministic mode, (ii) would be a severe breaking change -- it is completely legal today to reenter Wasm and access memory after traps, and it appears highly unlikely that no existing web page or application does that in the case of OOB. It would also have quite non-trivial consequences for the relaxed memory model, similar to "shrinking" a live memory.

@sparker-arm
Copy link
Author

it is completely legal today to reenter Wasm and access memory after traps, and it appears highly unlikely that no existing web page or application does that in the case of OOB.

But is the state of memory, after a trap, actually well-defined by the spec? This is what I've found:

A trap results if any of the accessed memory bytes lies outside the address range implied by the memory’s current size.

The execution of an instruction may trap, in which case the entire computation is aborted and no further modifications to the store are performed by it. (Other computations can still be initiated afterwards.)

The trap instruction represents the occurrence of a trap. Traps are bubbled up through nested instruction sequences, ultimately reducing the entire program to a single trap instruction, signalling abrupt termination.

This doesn't sound like the memory can be inspected to figure out what the program did before the trap executed.

@rossberg
Copy link
Member

But is the state of memory, after a trap, actually well-defined by the spec?

Yes, absolutely. The (non-normative) bits you quote just give a general intuition that a trap aborts the current computations, including any effect the current instruction might otherwise have.

The normative specification for executing stores is here. In particular, the bounds check is performed and results in a trap (step 12a) before any modification of the state is performed (step 15).

(This spec becomes much more complicated with the relaxed memory model in Wasm 3+, but the result is the same as fas as single-threaded memory is concerned. For shared memory, it also specifies that no mutation can be observed by concurrent reads, and that OOB checks are implicitly atomic.)

@sparker-arm
Copy link
Author

Thanks, but it's the notion of 'ultimately reducing the entire program to a single trap instruction', which I'm struggling with. Surely a program which is reduced to a single trap shouldn't have any affect on memory at all?

@rossberg
Copy link
Member

Ah, yes, I can see how that is confusing. "Program" is meant here in a more formalistic sense, being the current sequence of instructions executed (since entering Wasm), which is "reduced" to a result per a small-step reduction semantics. It doesn't mean that any actual code disappears, that still exists in the function store of the abstract machine and can be reinvoked.

@tlively
Copy link
Member

tlively commented Oct 18, 2023

Allowing partially OOB writes to nondeterministically write any value to the in-bounds region, then tightening that up to say they always write zeroes to the in-bounds region under the deterministic profile sounds ok to me.

As an alternative for the deterministic profile, would it be feasible to say that the in-bounds portion of the write will succeed? On the trapping path, the write would essentially have to be repeated byte-by-byte for the in-bounds region, but at least there's no need to know what the previous value in memory was under that scheme.

@rossberg
Copy link
Member

@tlively, I think the problem is that the trap handler has no access to the value that was attempted to be written.

@tlively
Copy link
Member

tlively commented Oct 18, 2023

I was assuming that the written value would be easier to provide to the trap handler than the previous in-memory value, but maybe it's still too complicated / slow to arrange for that.

@cfallin
Copy link

cfallin commented Oct 18, 2023

At least in Wasmtime, it's conceivable that we could include enough information in the trap metadata to replay the store byte-by-byte without having to include a full instruction disassembler in the segfault handler, starting from the register state. That would allow a "zero-cost" approach (in the same sense as for exceptions) where there's no action to save the store data on the common path. I think we'd still prefer the fixed-value (zeroes) approach to avoid that complexity if it's palatable, though I do understand that "in-bound part of the store succeeds" is much cleaner semantics.

@rossberg
Copy link
Member

@cfallin, yeah, I can believe that there is a plausible implementation strategy for some engines on some hardware platforms, but as long as it can't easily be done on all relevant hardware platforms, it would seem unwise to prescribe this semantics.

@titzer
Copy link

titzer commented Oct 18, 2023

@cfallin AFAICT It's possible to do this without metadata by decoding the faulting machine store instruction and getting the supplied value from either an immediate or a register (maybe memory, if it's a RMW, but that probably won't work if its a RMW of the same memory location, since per discussion thread the contents of the memory location are now unspecified on arm). Also, given that many different kind of store instructions could trap, the instruction decoder for that could be quite unwieldy.

Another option is to use metadata to encode the original (bytecode) PC and then replay the bytecode in an interpreter or JIT a stub. Those are probably out of wasmtime's design criteria, though. Although, given that there is probably a finite number of combinations of stores + regs, precompiling all such stubs might be possible.

@sparker-arm
Copy link
Author

given that many different kind of store instructions could trap, the instruction decoder for that could be quite unwieldy.

Indeed.

Another option is to use metadata to encode the original (bytecode) PC and then replay the bytecode in an interpreter or JIT a stub.

I'm not sure how this would work, without still having to decode the instruction to figure out what data needs to be stored where?

Storing zeros removes the need to decide out what to store, but we'd still need to decode to find out the start address of the unaligned access. (Though this is trivial for half-word stores).

To avoid any decoding, would it be feasible to generate a side-table to describe each machine-level store?

@sparker-arm
Copy link
Author

It also looks like partial writes are a potential problem for POWER too.

From Power ISA Version 3.1B, Book II, Chapter 2

Any other Load or Store instruction may be partially
executed and then aborted after having accessed a
portion of the storage operand..

When an instruction is aborted after being partially
executed, the contents of the instruction pointer
indicate that the instruction has not been executed,
however, the contents of some registers may have
been altered and some bytes within the storage
operand may have been accessed.

@titzer
Copy link

titzer commented Oct 19, 2023

@sparker-arm

I'm not sure how this would work, without still having to decode the instruction to figure out what data needs to be stored where?

Right, I meant that there'd be enough metadata to go back to executing the instruction in the interpreter, which means both the bytecode PC and the bytecode's inputs (value stack operands), which constitute the address and value stored. If that were compressed, it basically amounts to a sidetable as you describe.

Thanks for the additional context for POWER. The upshot of that seems to be basically the same as ARM.

cfallin added a commit to cfallin/wasmtime that referenced this issue Mar 22, 2024
…ng hardware.

As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some
instruction-set architectures do not guarantee that a store that
"partially traps" (overlaps multiple pages, only one of which disallows
the store) does not also have some side-effects. In particular, the part
of the store that *is* legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly
heap implementations: when a store is partially out-of-bounds, it should
trap (there is no "partially" here: if the last byte is out of bounds,
the store is out of bounds). A trapping store should not alter the final
memory state, which is observable by the outside world after the trap.
Yet, the simple lowering of a Wasm store to a machine store instruction
could violate this expectation, in the presence of "store tearing" as
described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and
we have observed it in tests on RISC-V.

This PR implements the idea first proposed [here], namely to prepend a
load of the same size to every store. The idea is that if the store will
trap, the load will as well; and precise exception handling, a
well-established principle in all modern ISAs, guarantees that
instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the
impact of this idea and to allow for precise trap execution semantics on
affected machines unless/until the spec is clarified.

On an Apple M2 Max machine (aarch64), this was measured to have a 2%
performance impact when running `spidermonkey.wasm` with a simple
recursive Fibonacci program. It can be used via the
`-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime.

[here]:
WebAssembly/design#1490 (comment)
cfallin added a commit to cfallin/wasmtime that referenced this issue Mar 22, 2024
…ng hardware.

As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some
instruction-set architectures do not guarantee that a store that
"partially traps" (overlaps multiple pages, only one of which disallows
the store) does not also have some side-effects. In particular, the part
of the store that *is* legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly
heap implementations: when a store is partially out-of-bounds, it should
trap (there is no "partially" here: if the last byte is out of bounds,
the store is out of bounds). A trapping store should not alter the final
memory state, which is observable by the outside world after the trap.
Yet, the simple lowering of a Wasm store to a machine store instruction
could violate this expectation, in the presence of "store tearing" as
described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and
we have observed it in tests on RISC-V.

This PR implements the idea first proposed [here], namely to prepend a
load of the same size to every store. The idea is that if the store will
trap, the load will as well; and precise exception handling, a
well-established principle in all modern ISAs, guarantees that
instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the
impact of this idea and to allow for precise trap execution semantics on
affected machines unless/until the spec is clarified.

On an Apple M2 Max machine (aarch64), this was measured to have a 2%
performance impact when running `spidermonkey.wasm` with a simple
recursive Fibonacci program. It can be used via the
`-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime.

[here]:
WebAssembly/design#1490 (comment)
@cfallin
Copy link

cfallin commented Mar 22, 2024

I've gone ahead and implemented the load-before-store idea in Wasmtime and measured a 2% perf impact on Apple aarch64 hardware (which doesn't see torn stores, but is what I have at hand). It would still be useful to continue discussing the spec change options above, but if it remains as-is, it seems we could implement "precise post-trap memory state" for those who need it with reasonable-ish overhead.

cfallin added a commit to cfallin/wasmtime that referenced this issue Mar 22, 2024
…ng hardware.

As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some
instruction-set architectures do not guarantee that a store that
"partially traps" (overlaps multiple pages, only one of which disallows
the store) does not also have some side-effects. In particular, the part
of the store that *is* legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly
heap implementations: when a store is partially out-of-bounds, it should
trap (there is no "partially" here: if the last byte is out of bounds,
the store is out of bounds). A trapping store should not alter the final
memory state, which is observable by the outside world after the trap.
Yet, the simple lowering of a Wasm store to a machine store instruction
could violate this expectation, in the presence of "store tearing" as
described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and
we have observed it in tests on RISC-V.

This PR implements the idea first proposed [here], namely to prepend a
load of the same size to every store. The idea is that if the store will
trap, the load will as well; and precise exception handling, a
well-established principle in all modern ISAs, guarantees that
instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the
impact of this idea and to allow for precise trap execution semantics on
affected machines unless/until the spec is clarified.

On an Apple M2 Max machine (aarch64), this was measured to have a 2%
performance impact when running `spidermonkey.wasm` with a simple
recursive Fibonacci program. It can be used via the
`-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime.

[here]:
WebAssembly/design#1490 (comment)
@ebeasant-arm
Copy link

Hi all: I've been following this one for a while now, and it seems to me to be a real challenge!

I'm trying to understand the rationale for mandating the behaviour in the WebAssembly specification came about - regardless on whether there are implementation that depend on the behaviour.
What follows is some observations, questions and some reasoning - seeking understanding, clarity and little more depth, if possible!

Partially OOB store behaviour - introduction of the required behaviour to the spec:

Following up how the mandated behaviour was introduced to the specification, I'm not sure of the history, but it feels like there is a case of 'Inception' going on:

WebAssembly/spec#438 , WebAssembly/spec#439 , #1025

These seem to form a sort of loop:
The bug in the tests raised the change which then caused the 'uncontroversial clarification', which then ended up in the specification.

However, I would hazard that that the uncontroversial clarification is anything but uncontroversial, given the amount of debate this has generated, and that at least three target architectures cannot efficiently implement it?

Architecture vs. implementation & probabilistic behaviours

In terms of the (specifically) architectural specs in question, the behaviour in point is architecturally permitted - Whether the behaviour occurs or not isn't just about the particular implementation of the Architecture: The behaviour that is problematic is also likely to be probabilistic dependent on current system state.
I'm not sure it's reliable to assert that anything (at least Arm based) either does or does not behave in a given way - more that it just hasn't been observed to behave in that way, yet.

I also believe that Architecturally, x86 has some complicated cases in which it may be permitted to have similar behaviour (unaligned writes are not guaranteed atomicity, for example) - this is something that we probably need chapter and verse on from Book 1 or 3 of the x86 spec...

Architectural litmus test tools like Hurd7 help to check for permitted behaviours across all the mentioned architectures - Notably, however, this particular case is also a challenge to represent in Hurd7 syntax at this moment in time, though as I understand it, it is being worked on. This could be potentially used to check the permitted an actual (it also generates sample code for execution on targets) behaviours across the architectures - but note the probabilistic effect is still also in play for on target tests.

Ways forward?

Given the apparent history of the introduction of this behaviour to the specification, and the impact on performance that other schemes for approaching it have been demonstrated to have:

Is it not reasonable on the balance of the design principals of WebAssembly to revert this tightening of requirements in the face of mounting evidence contrary to the assertion that introduced it? (convoluted sentence there, but the problem is convoluted!)

If the answer is that 'it can't be reverted because all use cases rely on the behaviour', then perhaps It does belong.
If it is 'is can't be reverted because some use cases rely on it', maybe profiles are the answer?
If it is 'it can't be reverted because its part of the published specification' then that's probably a question for the governance process?

Challenges, and some questions:

There are evidently issues when dealing with uses of WebAssembly (Some Cryptocurrency/Smart Contracts) that require deterministic behaviours to remain externally visible after a trap has occurred.

As I understand it The WebAssembly core specification defines the behaviour from the design point of view of internal consistency and determinism. At the point of trap, this viewpoint is no longer present, and the rules that apply to it cease to apply - hence it is valid to suggest a 'fix up' of the state for a failed trap on partially out of bounds, but this solution raises the question for me:

The requirement for deterministic state in the case of partially OOB writes is from that same external frame of reference, not internal.
So does it belong in the core specification? Is it in fact a form of interface contract instead?

Hoping that some of this leads to clarity and a good way forwards, many thanks to all on here for the discussion and hard work!

@rossberg
Copy link
Member

rossberg commented Jun 7, 2024

@ebeasant-arm:

As I understand it The WebAssembly core specification defines the behaviour from the design point of view of internal consistency and determinism. At the point of trap, this viewpoint is no longer present, and the rules that apply to it cease to apply - hence it is valid to suggest a 'fix up' of the state for a failed trap on partially out of bounds

I may be misreading what you mean here, but to clarify: the core spec specifies the observable behaviour of external invocations of Wasm functions in a given "store". And it specifies and guarantees precisely the state that a Wasm store is in after a trap — a trap only aborts the current computation, it does not invalidate the store it ran in.

That means that future external invocations of Wasm functions in the same context that previously trapped still have well-defined behaviour. This is absolutely crucial. Currently, this also is deterministic wrt to partially OOB traps. That could be relaxed, but at least in the deterministic profile it definitely needs to remain so.

@sparker-arm
Copy link
Author

That could be relaxed, but at least in the deterministic profile it definitely needs to remain so.

Do you have a view on how profiles would be implemented? Could this be a runtime parameter that is passed during module instantiation?

@rossberg
Copy link
Member

@sparker-arm, profiles are not chosen or configured dynamically or on a per-application basis. Profiles are a means to support the diversity of (disjoint) ecosystems that want to include Wasm but cannot support certain features. As such, a profile is selected globally by an ecosystem that supports Wasm. For example, a Wasm infrastructure for embedded devices might not be able to support GC or SIMD. Ecosystems like blockchains running Wasm code need determinism. All applications build for such an ecosystem are then bound by this system-wide choice of profile.

@sparker-arm
Copy link
Author

Right, sorry, I only now realise that you have a proposal for them! I will take a look.

@sparker-arm
Copy link
Author

@rossberg So, what's the status of the proposal? I see you have a PR open to add the deterministic profile.

Is the suggestion here to add a deterministic marker to the currently defined semantics of OOB stores?

@rossberg
Copy link
Member

The deterministic profile already was integrated into the relaxed SIMD proposal.

As for the OOB issue at hand, at least in the deterministic profile, the semantics has to remain as strict as it is right now. There has not been a resolution whether to relax it in general.

@sparker-arm
Copy link
Author

Okay, thanks for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests