Replacing `memset` and `memcpy` calls with `memory.fill` and `memory.copy`? #4403

torokati44 · 2021-12-20T11:58:45Z

If I were to try working on a wasm-opt pass doing what's written in the title, would it be accepted? Gated behind the bulk-memory extension/feature of course.

The rationale I have for this is that when compiling Rust, even with the bulk-memory target feature (codegen option) enabled, a lot of naive (slow) memcpy and memset calls are left in the result, because they are called from std/core library functions, such as __rust_alloc_zeroed and __rust_realloc. These can be avoided by passing the build-std option to Cargo, but that is still an unstable, nightly-only feature.

Of course, this sets some assumptions about what a function that happens to be named "memset" or "memcpy" is supposed to be doing, but in a lot of compiler toolchains, these are pretty much already handled as intrinsics anyway.

One alternative would be to detect some forms of memory copying or filling loops in the control flow graph instead, and only replace those with intrinsics.

What do you think?

The text was updated successfully, but these errors were encountered:

tlively · 2021-12-20T15:01:07Z

This is an interesting idea!

The rationale I have for this is that when compiling Rust, even with the bulk-memory target feature (codegen option) enabled, a lot of naive (slow) memcpy and memset calls are left in the result...

I would definitely want to see some measurements showing that using memory.copy and memory.fill would be faster than these Wasm implementations. I'm not sure that the difference would be that large, but it would be interesting to find out.

Of course, this sets some assumptions about what a function that happens to be named "memset" or "memcpy" is supposed to be doing, but in a lot of compiler toolchains, these are pretty much already handled as intrinsics anyway.

Since Binaryen generally avoids making assumptions about function naming conventions like these and in general might not get to see the function names, I don't think we would want to turn on a pass like this by default, e.g. I wouldn't want to include it in the passes run by -O2. It still might be interesting to add to Binaryen as a pass that users (or toolchains) can explicitly opt into, though.

One alternative would be to detect some forms of memory copying or filling loops in the control flow graph instead, and only replace those with intrinsics.

This would be really interesting as a code size optimization even if it didn't turn out to have much performance benefit.

torokati44 · 2021-12-20T22:48:05Z

I would definitely want to see some measurements showing that using memory.copy and memory.fill would be faster than these Wasm implementations. I'm not sure that the difference would be that large, but it would be interesting to find out.

Oh, they can barely even be compared ... given large enough chunks of memory to work with, perhaps.
Here is a few seconds long profile of a locally deployed https://ruffle.rs/demo running https://z0r.de/L/z0r-de_7617.swf, without bulk-memory:
https://share.firefox.dev/3qbU7Ss

And here is the same thing with bulk-memory:
https://share.firefox.dev/3EfzycL

The removed memset calls alone accounted for an almost 10% of the original total runtime, with another 3.3% remaining.

You can see how much faster vp_idct is, for example. The memset all but disappeared from it. And it's not just inlined or something, it is actually much quicker. While __rust_alloc_zeroed unfortunately stays the same, as mentioned in the original description.

Binaryen generally avoids making assumptions about function naming conventions like these and in general might not get to see the function names

Alright then, CFG matching and surgery it should be.
Finding the sweet spot in terms of how long the copied/set chunk of memory has to be for the intrinsic to be faster will be an interesting experiment (my prediction is that it will start to get faster really quickly)...

nickbabcock · 2021-12-24T15:57:35Z

I would like to also echo that I'm observing my Rust Wasm applications show memset and memcpy as performance culprits (though to a less extreme)

So it would seem like if wasm-opt gained the purposed functionality, it could have a wide benefit.

torokati44 · 2021-12-24T16:28:02Z

I have started working on something...
https://github.com/torokati44/binaryen/commits/bulkmemory-intrinsics
Please excuse the quality, it's just an experimental working draft for myself, and only does anything with memset. Also, for the moment, it makes it about 3x-4x slower, because it doesn't yet recognize and replace the loop itself, only (part of) it's body, which most likely works against the JIT, in addition to the overhead of the intrinsic.

torokati44 · 2021-12-24T16:34:16Z

Although, the same issue could also be raised for the WASM Rust WG, because the included memset and memcpy implementations (with the bulk-memory codegen option not enabled) process one byte at a time (partially unrolled to batches of 8), with i32 intermediates. It could easily copy/set 4 or even 8 bytes at a time. Not to mention v128 at a time if the SIMD extension is enabled.

torokati44 · 2021-12-31T01:17:01Z

@nickbabcock The current Rust 1.58.0 beta is performing somewhat better already in this regard, and it should come out as stable in just under two weeks. Maybe you too could give that a try.

nickbabcock · 2022-01-01T13:44:26Z

Thanks for the heads up. Spent a few minutes profiling the 1.58 beta. There does seem to be a small improvement (low single digit percentage). My workload can be characterized by small allocations, so the word-wise copies introduced in 1.58 shouldn't have too big of an impact (and idk, maybe memory.fill won't either).

torokati44 · 2022-01-01T14:26:45Z

I see. You can try that too with RUSTFLAGS="-C target-feature=+bulk-memory" (or similar), but anything under 10-100 bytes likely won't benefit from it due to the overhead it has. And the built module won't work in Safari 14.x.

kripken · 2022-01-05T00:09:41Z

Nice idea!

What size memcpy/memset are measured here? It definitely makes sense that big ones would get faster with this, but I worry about small ones. The VM may do a call instead of emitting code inline which can be a lot slower potentially. I vaguely remember measurements about that from a few years ago that were not good, but perhaps things have improved.

If we measure that and it's an issue then we could perhaps avoid doing this when the size is constant and small enough.

torokati44 · 2022-01-05T22:45:21Z

What size memcpy/memset are measured here?

In the linked/attached profiles? I don't know, really. And since the memsets are mostly compiler-generated, I also don't know of any straightforward ways to tell.

Anyway, a fellow developer did this benchmark:

(Ignore the wasm-opt in all the legend labels, it didn't make much of a difference as of right now. The line styles each correspond to Rust stable (1.57), nightly (1.59), and I think also nightly, but with bulk-memory enabled.)

This is how he summarized it: "main point being, indeed under 8 bytes a memcpy can be 2-3x cheaper than memory.copy"

I'd say 8 bytes is a really low turnover point, I would have guessed/expected something closer to, say, a 100.

This was the code he ran:

#[inline(never)]
pub fn bench(dst: &mut Vec<u8>, src: &Vec<u8>) {
    if dst.len() == src.len() {
        dst[..].copy_from_slice(&src);
    }
}

#[wasm_bindgen]
pub fn work(n: usize, runs: usize) {
    let mut a = vec![0; n];
    let b = vec![0; n];
    for _ in 0..runs {
        bench(&mut a, &b);
    }
}

Maybe the optimization could, if the size is not a known constant, even add a branch as well, to do it one way or another, based on the number of bytes to be done...

kripken · 2022-01-05T23:42:41Z

Thanks @torokati44 ! 8 is indeed fairly small.

There is some data on typical values of memcpy/memset in the wild, like this paper (raw data). The common values are usually very small. So 8 is in the relevant range. (Those links are for LLVM-libc work, it may make sense to learn from them regarding possible solutions at the toolchain level.)

Although, the same issue could also be raised for the WASM Rust WG, because the included memset and memcpy implementations (with the bulk-memory codegen option not enabled) process one byte at a time (partially unrolled to batches of 8), with i32 intermediates. It could easily copy/set 4 or even 8 bytes at a time. Not to mention v128 at a time if the SIMD extension is enabled.

I would recommend doing that. Copying 4 bytes at a time is what emscripten does atm, for example.

And yes, SIMD can do even better. Potentially it could compete with the browser's internal copying.

Maybe the optimization could, if the size is not a known constant, even add a branch as well, to do it one way or another, based on the number of bytes to be done...

Also reasonable. For comparison, in emscripten if the size is large enough we call out to JS and use typed array .set() which is very fast. I assume that uses the same memcpy that bulk memory would but has better backwards compatibility.

torokati44 · 2022-01-13T08:25:12Z

Just a heads-up: With ruffle-rs/ruffle#5834 just merged, and Rust 1.58 (with its improved built-in memory loops) also being right around the corner, this got a lot less important for me personally, so I don't think I'll pursue this any further in the near future. While it does sound really interesting, I only have so much free time, and so many things I want to do.
If anyone wants to pick it up, feel free to look at the branch linked above in my fork; or take some ideas from here, which looks like it does a really similar thing, albeit for a different IR: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp

torokati44 mentioned this issue Dec 30, 2021

Emitted memset and memcpy are really slow on WASM rust-lang/rust#92436

Open

kripken mentioned this issue Jan 5, 2022

Use WebAssembly bulk memory opcodes WebAssembly/wasi-libc#263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing `memset` and `memcpy` calls with `memory.fill` and `memory.copy`? #4403

Replacing `memset` and `memcpy` calls with `memory.fill` and `memory.copy`? #4403

torokati44 commented Dec 20, 2021 •

edited

Loading

tlively commented Dec 20, 2021

torokati44 commented Dec 20, 2021 •

edited

Loading

nickbabcock commented Dec 24, 2021

torokati44 commented Dec 24, 2021 •

edited

Loading

torokati44 commented Dec 24, 2021 •

edited

Loading

torokati44 commented Dec 31, 2021

nickbabcock commented Jan 1, 2022

torokati44 commented Jan 1, 2022 •

edited

Loading

kripken commented Jan 5, 2022

torokati44 commented Jan 5, 2022 •

edited

Loading

kripken commented Jan 5, 2022

torokati44 commented Jan 13, 2022 •

edited

Loading

Replacing memset and memcpy calls with memory.fill and memory.copy? #4403

Replacing memset and memcpy calls with memory.fill and memory.copy? #4403

Comments

torokati44 commented Dec 20, 2021 • edited Loading

tlively commented Dec 20, 2021

torokati44 commented Dec 20, 2021 • edited Loading

nickbabcock commented Dec 24, 2021

torokati44 commented Dec 24, 2021 • edited Loading

torokati44 commented Dec 24, 2021 • edited Loading

torokati44 commented Dec 31, 2021

nickbabcock commented Jan 1, 2022

torokati44 commented Jan 1, 2022 • edited Loading

kripken commented Jan 5, 2022

torokati44 commented Jan 5, 2022 • edited Loading

kripken commented Jan 5, 2022

torokati44 commented Jan 13, 2022 • edited Loading

Replacing `memset` and `memcpy` calls with `memory.fill` and `memory.copy`? #4403

Replacing `memset` and `memcpy` calls with `memory.fill` and `memory.copy`? #4403

torokati44 commented Dec 20, 2021 •

edited

Loading

torokati44 commented Dec 20, 2021 •

edited

Loading

torokati44 commented Dec 24, 2021 •

edited

Loading

torokati44 commented Dec 24, 2021 •

edited

Loading

torokati44 commented Jan 1, 2022 •

edited

Loading

torokati44 commented Jan 5, 2022 •

edited

Loading

torokati44 commented Jan 13, 2022 •

edited

Loading