Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zero length bulk memory behavior incompatible with LLVM intrinsics #1482

Closed
tlively opened this issue Jul 10, 2023 · 32 comments
Closed

zero length bulk memory behavior incompatible with LLVM intrinsics #1482

tlively opened this issue Jul 10, 2023 · 32 comments

Comments

@tlively
Copy link
Member

tlively commented Jul 10, 2023

It was raised in llvm/llvm-project#63755 that LLVM's memcpy and memset intrinsics are specified such that a zero-length access is a no-op, whereas memory.copy and memory.fill are currently specified such that they will trap if offset + length > memory size, i.e. they can trap for zero-length accesses beyond the end of the memory. This means that to correctly lower LLVM's intrinsics under the current semantics, we would in general have to emit a length check and only execute memory.copy or memory.set if the length is not zero.

It would be nice to be able to emit memory.copy and memory.fill without the surrounding length check, as we (incorrectly) do today. One way to achieve that would be to relax the WebAssembly semantics so that these instructions never trap when the length is zero, just like the LLVM intrinsics. Would there be any appetite for making that change?

@andrewrk
Copy link

andrewrk commented Jul 10, 2023

Speaking for the Zig project, which by the way uses webassembly to build from source, the specification status quo is problematic for our code generation backend as well for the exact same reasons.

@dschuff
Copy link
Member

dschuff commented Jul 10, 2023

Aside from just the LLVM langref, this would also seem to require an implementation of memcpy/memmove to have similar checks. The C standard doesn't directly specify the case where the length is 0, but the obvious interpretation of "copy 0 characters" would be a no-op. I think this would be worthwhile, as it would just require engines/the spec to reverse steps 12 and 13 of the exec semantics for memory.copy and memory.fill, and would be backward compatible (i.e. allowing a program that we previously trapped/rejected).

@eqrion
Copy link

eqrion commented Jul 10, 2023

Do we happen to know the rationale for why this route was chosen? We had a discussion about this internally recently for unrelated reasons, and I couldn't remember why we did it this way. The usual suspect here is threading, but I don't think that was the case for this one.

@andrewrk
Copy link

@conrad-watt
Copy link
Contributor

Previous discussion here WebAssembly/bulk-memory-operations#124 (nothing to do with threads, we just thought at the time that this would be a minor simplification for codegen).

@tlively
Copy link
Member Author

tlively commented Jul 11, 2023

Looking back at those conversations, I believe that if we had realized at the time that the simplifications we were considering would have required us to emit a user space length check, we would not have made those simplifications. @rossberg, you were heavily involved in those discussions. Do you agree? What do you think of relaxing the semantics now?

@rossberg
Copy link
Member

rossberg commented Jul 11, 2023

My recollection is that the final discussion and decision happened at a face-to-face meeting.

If I recall correctly, I initially was in favour of always allowing the zero case. But then we decided to require upfront OOB checks (IIRC, to avoid partial writes, and for well-behaved interaction with shared memory), and under that semantics, making a special case for zero actually requires more instructions to codegen and to execute in most cases, not fewer.

I believe we discussed LLVM at the meeting, but I don't recall what the arguments were. Can the LLVM intrinsic guarantee a behaviour that is somehow equivalent to OOB being checked upfront? The linked LLVM specification says nothing, so I doubt that's the case with the usual virtual memory techniques for trapping on OOB. Which means you need to generate an explicit OOB check anyway, and skipping it for length zero is not a simplification. Or am I missing something?

PS: I would also argue that an impedance mismatch with LLVM alone is not sufficient reason to make a change. Since the current semantics ultimately requires less code, there would have to be more evidence of consistent problems across various consumers to go with something that's worse in principle.

@tlively
Copy link
Member Author

tlively commented Jul 11, 2023

Yes, I guess having the semantics do a bounds check except in the case of zero length would mean the engine has to emit the same length check we are trying to avoid having to emit in user space, so the only benefit of making this change to the semantics would be tiny code size savings, assuming the engine can't make the length check any faster than it can be in user space.

@Luukdegram
Copy link

Which means you need to generate an explicit OOB check anyway, and skipping it for length zero is not a simplification. Or am I missing something?

I believe this is a good point. This would mean that regardless of a change in the WebAssembly specification, LLVM would still (although currently missing) emit a check for zero length to comply with its specified semantics.

PS: I would also argue that an impedance mismatch with LLVM alone is not sufficient reason to make a change. Since the current semantics ultimately requires less code, there would have to be more evidence of consistent problems across various consumers to go with something that's worse in principle.

While I agree LLVM alone is not a sufficient reason to make a change, this affects any compiler/code-generation tool wanting to generate for the WebAssembly target (For languages where it's non-illegal to perform a memcpy OOB with a zero-length). For example, Zig also has its own WebAssembly backend besides its LLVM backend. It currently requires to implement this same check there, as LLVM has to. I believe Odin has the same semantics. So while it's true that changing the semantics wouldn't be a simplification with regards to LLVM, it would be for any other toolchain wishing to target WebAssembly that holds true to the non-trapping semantics for zero-length OOB operations.

@rossberg
Copy link
Member

My understanding is that it would only be a simplification for backends that happen to exactly match the Wasm semantics, both in terms of the zero case and the upfront OOB check. For all others, making a special case for zero length would in fact not only not be a simplification, it would be an extra complication.

@nikic
Copy link

nikic commented Jul 11, 2023

My understanding is that it would only be a simplification for backends that happen to exactly match the Wasm semantics, both in terms of the zero case and the upfront OOB check. For all others, making a special case for zero length would in fact not only not be a simplification, it would be an extra complication.

I don't really understand why there is a need to match semantics to the upfront OOB check. To clarify, we're talking about lowering llvm.memcpy to memory.copy, not about lowering memory.copy to llvm.memcpy. The latter obviously can't work without additional checks. The former works apart from the zero case -- the non-zero OOB case would be UB, so lowering that to a trap is fine.

(llvm.memcpy being the LLVM-specific intrinsic here, but as noted above, other compilers also specify this operation as accepting invalid pointers for zero-length copies.)

@andrewrk
Copy link

assuming the engine can't make the length check any faster than it can be in user space.

This is a load-bearing assumption. There could be many reasons that a runtime does not actually need to do a length zero check, or even a trap check on the pointer. Here is one, for example:
https://github.com/ziglang/zig/blob/6bc9c4f716afbb15465f79f84f61221f394fa5b6/stage1/wasm2c.c#L2080-L2106

Having the check in userspace prevents the possibility of such optimizations by runtimes, because even if they were able to omit it, the userspace code would pay the cost.

Semantically, a trap happens to prevent a read or write from an invalid memory address. A zero-length operation does not actually read or write from an invalid memory address. We all understand this intuitively. The only reason to not codify it into the spec would be for some indisputable, pragmatic reason, in which we are willing to compromise the null hypothesis in order to satisfy some real world use case. However, in this thread you have 3 examples, including LLVM, where pragmatism actually points the other direction.

@conrad-watt
Copy link
Contributor

conrad-watt commented Jul 11, 2023

The only reason to not codify it into the spec would be for some indisputable, pragmatic reason

I think the original reasoning for the current semantics is clear from our previous discussions - the Wasm-level check requires fewer machine instructions. So if most toolchains were happily able to target memory.copy without additional guards, the current semantics would seem preferable.

It's worth noting that, as discussed in WebAssembly/bulk-memory-operations#145, the C standard says that zero-length memcpy is still undefined behaviour if a provided pointer is invalid, so trapping in this case (i.e. direct use of Wasm memory.copy) is permitted. One problem that's become apparent in this thread is that LLVM has a more defined semantics for zero-length memcpy (via LLVM's intrinsic) and so there ends up being a mismatch going from C(-like) -> LLVM -> Wasm. FWIW I'd be fine with changing Wasm's memory.copy, but there were pragmatic reasons for the current semantics.

@tlively
Copy link
Member Author

tlively commented Jul 11, 2023

Having the check in userspace prevents the possibility of such optimizations by runtimes, because even if they were able to omit it, the userspace code would pay the cost.

Conversely, having the check in userspace allows it to be optimized out by frontends or Binaryen without engines having to do anything.

@andrewrk
Copy link

Can you point to any real world runtimes where the check would be needed? I've already pointed to one that does not.

It's also easy to imagine the opposite; the requirement to trap requires an extra check on the pointer validity where an early return on len==0 would be preferred by the runtime.

the Wasm-level check requires fewer machine instructions.

huh? no, it requires more. example: ziglang/zig#16345

@conrad-watt
Copy link
Contributor

conrad-watt commented Jul 12, 2023

the Wasm-level check requires fewer machine instructions.

huh? no, it requires more. example: ziglang/zig#16345

When I say "fewer machine instructions", I'm talking about compiling Wasm memory.copy to machine instructions. The PR you've linked appears to be about compiling Zig to Wasm via LLVM?

Can you point to any real world runtimes where the check would be needed?

It's also easy to imagine the opposite; the requirement to trap requires an extra check on the pointer validity where an early return on len==0 would be preferred by the runtime.

The point is that the current semantics falls out naturally from doing the single check if offset+len > segment.len, then trap, without an additional initial check for if len = 0, then return. I think this issue makes a reasonable argument for adding this extra check (which may well be optimised away by the Wasm engine anyway), I'm just trying to make sure that the status quo is clearly understood.

To be clear, changing the current semantics would involve adding a check in the machine code generated from Wasm memory.copy. I understand that the argument in this thread is that most Wasm producers end up inserting this extra check programmatically anyway.

@andrewrk
Copy link

andrewrk commented Jul 12, 2023

Here is a real world example:

https://github.com/bytecodealliance/wasmtime/blob/42919fa15ac5dbb0b42d1979e8d44e6b3bd066d4/crates/runtime/src/instance.rs#L822-L860

I agree, this specification change would require adding a check to this particular runtime.

Let's tally up the real world examples provided so far:

Producers that would save an extra check: 3 (LLVM, Zig, Odin)
Producers that would not save an extra check: 0

Runtimes that would need the extra check: 1 (wasmtime)
Runtimes that would not need the extra check: 1 (zig's wasm2c)

My other point is that this optimization...

--- a/crates/runtime/src/instance.rs
+++ b/crates/runtime/src/instance.rs
@@ -844,6 +844,7 @@ impl Instance {
         val: u8,
         len: u64,
     ) -> Result<(), Trap> {
+        if len == 0 return;
         let memory = self.get_memory(memory_index);
         let dst = self.validate_inbounds(memory.current_length(), dst, len)?;

...is not allowed by the spec since it is required to trap.

@tlively
Copy link
Member Author

tlively commented Jul 13, 2023

@andrewrk, IIUC, zig's wasm2c implementation does not match the current spec because it can do a partial write before going OOB, and that partial write is observable after the trap. Spec compliance would require it to emit an OOB check before the memmove/memset/etc, and if we made the change under discussion here, that OOB check would also have to add a special case for length 0, just like wasmtime's implementation.

@conrad-watt
Copy link
Contributor

conrad-watt commented Jul 13, 2023

Let's tally up the real world examples provided so far:

Producers that would save an extra check: 3 (LLVM, Zig, Odin)
...
Runtimes that would need the extra check: 1 (wasmtime)

I'm not really happy with the above argumentation - every reasonable machine-code-producing runtime would need an extra check with the proposed semantics (e.g. V8, Spidermonkey, Webkit, etc), and every producer using LLVM needs an extra check with the current semantics (e.g. Wasm backends for C/C++, Zig, Odin, etc).

To reiterate, I think we can reasonably consider changing the semantics, if the trade-offs make sense. I don't think a raw tally of "machine-code-producing runtimes" vs "LLVM-based Wasm producers" is the right signal, however.

EDIT: a much stronger signal for me would be "do programmatic zero-length checks (inserted by Wasm producers instead of being part of the runtime) give rise to meaningfully different Wasm binary sizes, runtime performance, or optimisation opportunities?"

My other point is that this optimization...

This is potentially a pessimisation for small non-zero-length copies, although I'd expect that any difference between the two semantics discussed here would be "in the noise" for most end-to-end benchmarks.

@andrewrk
Copy link

andrewrk commented Jul 13, 2023

EDIT: a much stronger signal for me would be "do programmatic zero-length checks (inserted by Wasm producers instead of being part of the runtime) give rise to meaningfully different Wasm binary sizes, runtime performance, or optimisation opportunities?"

I can provide one such data point:

Our 2.4 MiB wasm executable for bootstrapping the zig compiler gained 3.6 KB of file size from adding this check. I would consider this negligible.

As for performance, here is a comparison of how long it takes to compile the output of wasm2c with gcc for the two versions of this wasm file:

Benchmark 1 (3 runs): gcc ... -c zig1.no-memcpy-check.wasm.c
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           156s  ± 8.10s      151s  …  166s           0 ( 0%)        0%
  peak_rss           2.97GB ±  331KB    2.97GB … 2.97GB          0 ( 0%)        0%
  cpu_cycles          634G  ± 2.69G      631G  …  636G           0 ( 0%)        0%
  instructions       1.08T  ±  220M     1.08T  … 1.08T           0 ( 0%)        0%
  cache_references   26.2G  ±  235M     26.0G  … 26.4G           0 ( 0%)        0%
  cache_misses       3.97G  ± 57.3M     3.91G  … 4.01G           0 ( 0%)        0%
  branch_misses      3.31G  ± 2.18M     3.31G  … 3.31G           0 ( 0%)        0%
Benchmark 2 (3 runs): gcc ... -c zig1.with-len0-check.wasm.c
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           158s  ± 9.50s      152s  …  169s           0 ( 0%)          +  1.3% ± 12.8%
  peak_rss           2.97GB ±  414KB    2.97GB … 2.97GB          0 ( 0%)          -  0.1% ±  0.0%
  cpu_cycles          632G  ±  822M      631G  …  633G           0 ( 0%)          -  0.3% ±  0.7%
  instructions       1.08T  ±  120M     1.08T  … 1.08T           0 ( 0%)          -  0.2% ±  0.0%
  cache_references   26.2G  ±  105M     26.1G  … 26.4G           0 ( 0%)          -  0.0% ±  1.6%
  cache_misses       3.94G  ± 68.9M     3.87G  … 4.00G           0 ( 0%)          -  0.9% ±  3.6%
  branch_misses      3.33G  ± 6.07M     3.32G  … 3.34G           0 ( 0%)          +  0.6% ±  0.3%

The wasm file is constructed via LLVM with -Os optimizations and then further optimized by wasm-opt -Oz --enable-bulk-memory --enable-sign-ext. zig1.no-memcpy-check.wasm was produced by master branch with ziglang/zig#16345 reverted; zig1.with-len0-check.wasm produced by master branch (ziglang/zig@e05c242).

Edit: I still think this check belongs in runtimes, not producers.

@titzer
Copy link

titzer commented Jul 13, 2023

Given a choice between two semantics that are both safe but one implies an additional check for a Wasm engine to perform (on essentially all hardware), we've generally opted to save the check in the Wasm level and push it up to producers. The reasoning is that a check at the Wasm level costs all producers and a check at the producer level only affects some--in particular, a producer may have a more powerful static analysis to remove said check (in the case, 0-byte memcopies).

I am against making a change to the semantics here because it's a breaking change (something no longer traps[1]) that would only benefit Wasm binary size (for non-optimizing producers). If we want a zero-length checking memcpy variant, we should introduce a new opcode.

[1] Applications can and do use traps as a safety check mechanism--by changing behavior, this breaking change could cause applications to execute in unanticipated ways.

@penzn
Copy link

penzn commented Jul 13, 2023

I think current semantics are somewhat unfortunate, because we are turning something which could be a complete nop into a address bounds check without a corresponding memory access. Even if it can be eliminated it still carries some (small?) cost, without an obvious benefit. I guess for me the issue is not when/where to emit the check, but why the check is necessary at all.

On the other hand, I think LLVM mismatch isn't that much of a big deal - LLVM would remove the intrinsic call if it can prove length parameter is zero bytes (and in some other cases possibly), which is valid in C, the invocations that can't be removed are just passed along to the runtime library as is. While those might trap in some cases in Wasm it doesn't make LLVM transformations wrong, at least in a sense of not turning an invalid program into a valid one.

@rossberg
Copy link
Member

@andrewrk, you seem to be ignoring the main observation made several times in this thread. Let me repeat @tlively's question from above: does the zig implementation perform the upfront OOB check that is also required by the Wasm spec?

@penzn, AFAICS, the assumption that the current semantics leads to more checks is just false. When actually taking the OOB check into consideration, it is in fact the opposite, which is why we originally went with the current semantics. The only case that could benefit from the change would be a mem.copy whose length is statically known to be zero, because that could be turned into a nop. All other cases (i.e., the useful ones) would pay with an extra conditional.

@conrad-watt
Copy link
Contributor

conrad-watt commented Jul 13, 2023

does the zig implementation perform the upfront OOB check that is also required by the Wasm spec?

IIUC, Zig -> LLVM -> Wasm should be performing this check already, and @tlively's observation only applies to the stripped down wasm2c that's (only?) used in Zig's bootstrapping process.

EDIT:

The only case that could benefit from the change would be a mem.copy whose length is statically known to be zero, because that could be turned into a nop. All other cases (i.e., the useful ones) would pay with an extra conditional.

In the case that the length is really statically known to be zero, in principle analogous optimisations could be carried out whether the zero-length check is producer-level or engine-level. I think the two ways of distinguishing the two approaches are:

  • does any Wasm backend actually get to naively emit memory.copy (etc) without a zero-length check given the current semantics? In principle this is possible given C/C++'s source semantics, but it seems like anything going through LLVM requires that a zero-length check be present

  • in the cases that a zero-length check is inevitably required, is there any difference between the two approaches in ability to optimise such checks for size/speed?

@andrewrk
Copy link

andrewrk commented Jul 13, 2023

@andrewrk, you seem to be ignoring the main observation made several times in this thread. Let me repeat @tlively's question from above: does the zig implementation perform the upfront OOB check that is also required by the Wasm spec?

I'm not sure which comment you're referring to, since I don't see a question directed at me, but I already answered your question here: #1482 (comment)

It's important to remember that producers and consumers have different relationships with the specification. Zig's wasm2c is not intended to be a specification compliant, general-purpose tool. Its only purpose is to input a specification-compliant webassembly executable and convert it into C code that can then be used to bootstrap our compiler.

You might be tempted to dismiss this use case because it's not a general purpose wasm runtime, but it's important to remember that one of the main benefits of a specification is that it allows for tools exactly like this - ones that rely on the other party adherring to specification and then do whatever happens to be useful.

This is one reason I think if there needs to be an extra branch on one side of the fence, it should be on the runtime side.

@rossberg
Copy link
Member

Now I'm really confused. Are you asking that the Wasm standard should be changed in order to enable an intentionally non-compliant implementation to do a certain optimisation, while imposing extra costs on all compliant implementations? Moreover, if an implementation is non-compliant anyway, why can't it relax this rule as it sees fit?

@tlively
Copy link
Member Author

tlively commented Jul 13, 2023

@andrewrk, he was referring to my comment #1482 (comment), pointing out that Zig's wasm2c implementation, which you provided as an example runtime that "does not actually need to do a length zero check, or even a trap check on the pointer" does not appear to be spec-compliant, and making it spec compliant would require adding a trap check but not a zero check under the current semantics, or both a trap check and a zero check (to make the trap check not apply in that special case) under the semantics you are arguing for.

Edit: But I guess you already figured out what he was referring to, so now I'm confused as well for the same reason as @rossberg

@andrewrk
Copy link

andrewrk commented Jul 13, 2023

I think y'all are confusing yourselves.

Semantically, a trap happens to prevent a read or write from an invalid memory address. A zero-length operation does not actually read or write from an invalid memory address. We all understand this intuitively. The only reason to not codify it into the spec would be for some indisputable, pragmatic reason, in which we are willing to compromise the null hypothesis in order to satisfy some real world use case.

This framing of the problem is undisputed. The only thing being discussed in this thread is what pragmatism looks like. So far I'm the only one who has provided any real world examples and any real world performance measurements. I don't work on a web browser or a general-purpose wasm runtime, so that's not really what I bring to the table. I invite those representatives to do their part.

@conrad-watt
Copy link
Contributor

conrad-watt commented Jul 13, 2023

This framing of the problem is undisputed.

One dynamic that isn't surfaced in the above framing - there's a bias towards the incumbent semantics because it's relatively expensive to coordinate a semantic change across Wasm runtimes (consider - users can choose the version of toolchain they use to produce Wasm, but by and large not the version of browser that the Wasm runs in). So it's not totally accurate to frame maintaining our current semantics as "compromis[ing] the null hypothesis" - if anything the burden of proof is the other way round (edit: hence my comments on "signals" above).

@tlively
Copy link
Member Author

tlively commented Jul 13, 2023

I think we've made it very clear that spec-compliant runtimes would have to emit an extra zero check if we exempted zero-length copies from the normal up-front bounds checks, so this change is performance-neutral at best where toolchains would previously have had to emit the zero check themselves and a pessimization where toolchains already would have been able to avoid a zero check. @andrewrk, do you agree with this so far? You have not acknowledged these points yet.

The only way to make exempting zero-length copies from bounds checks an optimization would be if we stopped requiring up-front bounds checks and allowed the bounds checking to be done via trap handlers, possibly after a partial write. But that's a much bigger change and one we explicitly decided against when bulk memory operations were standardized, so I would not want to open that back up for discussion without a great deal of new information.

@andrewrk
Copy link

andrewrk commented Jul 13, 2023

I think I've contributed as much as I can to this issue. I'll leave you all with a summary of my position:

  • Zig's LLVM backend currently requires an additional len==0 check to work around [WebAssembly] zero-length memcpy does not result in no-op when bulk-memory is enabled llvm/llvm-project#63755 which is an LLVM bug and needs to be fixed in LLVM.
  • Zig's WebAssembly backend currently requires an additional len==0 check to lower memset and memcpy, according to the status quo WebAssembly specification.
  • Perf data point: zero length bulk memory behavior incompatible with LLVM intrinsics #1482 (comment)
  • In general, it is better for wasm binaries to be thinner and runtimes to be fatter. I don't acknowledge the claim that this would be a pessimization in runtimes. I suspect it may even be an optimization. I think the claims made in this thread lack the support of real world performance measurements. It's also not taking into account the cost of extra bytes over the Internet from fatter wasm binaries. In general, the less logic that is inside wasm code, the fewer unknown-unknown problems occur. That's the whole point of bulk-memory operations after all - offloading some branching logic to the runtime.
  • Additional opcodes for bulk memory operations that allow len==0 with invalid pointer address would be considered an improvement from status quo. If both variants were available, it would be very interesting to see which ones end up preferred by producers.

@tlively
Copy link
Member Author

tlively commented Jul 13, 2023

Thanks, @andrewrk. FWIW, I think your position that runtimes should do more so toolchains can do less is entirely reasonable, but not the position WebAssembly has historically taken when presented with similar trade offs.

I'll go ahead and close this issue, but if anyone does have new information to contribute, feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants