Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing memset and memcpy calls with memory.fill and memory.copy? #4403

Open
torokati44 opened this issue Dec 20, 2021 · 12 comments
Open

Replacing memset and memcpy calls with memory.fill and memory.copy? #4403

torokati44 opened this issue Dec 20, 2021 · 12 comments

Comments

@torokati44
Copy link

torokati44 commented Dec 20, 2021

If I were to try working on a wasm-opt pass doing what's written in the title, would it be accepted? Gated behind the bulk-memory extension/feature of course.

The rationale I have for this is that when compiling Rust, even with the bulk-memory target feature (codegen option) enabled, a lot of naive (slow) memcpy and memset calls are left in the result, because they are called from std/core library functions, such as __rust_alloc_zeroed and __rust_realloc. These can be avoided by passing the build-std option to Cargo, but that is still an unstable, nightly-only feature.

Of course, this sets some assumptions about what a function that happens to be named "memset" or "memcpy" is supposed to be doing, but in a lot of compiler toolchains, these are pretty much already handled as intrinsics anyway.

One alternative would be to detect some forms of memory copying or filling loops in the control flow graph instead, and only replace those with intrinsics.

What do you think?

@tlively
Copy link
Member

tlively commented Dec 20, 2021

This is an interesting idea!

The rationale I have for this is that when compiling Rust, even with the bulk-memory target feature (codegen option) enabled, a lot of naive (slow) memcpy and memset calls are left in the result...

I would definitely want to see some measurements showing that using memory.copy and memory.fill would be faster than these Wasm implementations. I'm not sure that the difference would be that large, but it would be interesting to find out.

Of course, this sets some assumptions about what a function that happens to be named "memset" or "memcpy" is supposed to be doing, but in a lot of compiler toolchains, these are pretty much already handled as intrinsics anyway.

Since Binaryen generally avoids making assumptions about function naming conventions like these and in general might not get to see the function names, I don't think we would want to turn on a pass like this by default, e.g. I wouldn't want to include it in the passes run by -O2. It still might be interesting to add to Binaryen as a pass that users (or toolchains) can explicitly opt into, though.

One alternative would be to detect some forms of memory copying or filling loops in the control flow graph instead, and only replace those with intrinsics.

This would be really interesting as a code size optimization even if it didn't turn out to have much performance benefit.

@torokati44
Copy link
Author

torokati44 commented Dec 20, 2021

I would definitely want to see some measurements showing that using memory.copy and memory.fill would be faster than these Wasm implementations. I'm not sure that the difference would be that large, but it would be interesting to find out.

Oh, they can barely even be compared ... given large enough chunks of memory to work with, perhaps.
Here is a few seconds long profile of a locally deployed https://ruffle.rs/demo running https://z0r.de/L/z0r-de_7617.swf, without bulk-memory:
https://share.firefox.dev/3qbU7Ss

And here is the same thing with bulk-memory:
https://share.firefox.dev/3EfzycL

The removed memset calls alone accounted for an almost 10% of the original total runtime, with another 3.3% remaining.

You can see how much faster vp_idct is, for example. The memset all but disappeared from it. And it's not just inlined or something, it is actually much quicker. While __rust_alloc_zeroed unfortunately stays the same, as mentioned in the original description.

Binaryen generally avoids making assumptions about function naming conventions like these and in general might not get to see the function names

Alright then, CFG matching and surgery it should be.
Finding the sweet spot in terms of how long the copied/set chunk of memory has to be for the intrinsic to be faster will be an interesting experiment (my prediction is that it will start to get faster really quickly)...

@nickbabcock
Copy link

I would like to also echo that I'm observing my Rust Wasm applications show memset and memcpy as performance culprits (though to a less extreme)

image

So it would seem like if wasm-opt gained the purposed functionality, it could have a wide benefit.

@torokati44
Copy link
Author

torokati44 commented Dec 24, 2021

I have started working on something...
https://github.com/torokati44/binaryen/commits/bulkmemory-intrinsics
Please excuse the quality, it's just an experimental working draft for myself, and only does anything with memset. Also, for the moment, it makes it about 3x-4x slower, because it doesn't yet recognize and replace the loop itself, only (part of) it's body, which most likely works against the JIT, in addition to the overhead of the intrinsic.

@torokati44
Copy link
Author

torokati44 commented Dec 24, 2021

Although, the same issue could also be raised for the WASM Rust WG, because the included memset and memcpy implementations (with the bulk-memory codegen option not enabled) process one byte at a time (partially unrolled to batches of 8), with i32 intermediates. It could easily copy/set 4 or even 8 bytes at a time. Not to mention v128 at a time if the SIMD extension is enabled.

@torokati44
Copy link
Author

@nickbabcock The current Rust 1.58.0 beta is performing somewhat better already in this regard, and it should come out as stable in just under two weeks. Maybe you too could give that a try.

@nickbabcock
Copy link

Thanks for the heads up. Spent a few minutes profiling the 1.58 beta. There does seem to be a small improvement (low single digit percentage). My workload can be characterized by small allocations, so the word-wise copies introduced in 1.58 shouldn't have too big of an impact (and idk, maybe memory.fill won't either).

@torokati44
Copy link
Author

torokati44 commented Jan 1, 2022

I see. You can try that too with RUSTFLAGS="-C target-feature=+bulk-memory" (or similar), but anything under 10-100 bytes likely won't benefit from it due to the overhead it has. And the built module won't work in Safari 14.x.

@kripken
Copy link
Member

kripken commented Jan 5, 2022

Nice idea!

What size memcpy/memset are measured here? It definitely makes sense that big ones would get faster with this, but I worry about small ones. The VM may do a call instead of emitting code inline which can be a lot slower potentially. I vaguely remember measurements about that from a few years ago that were not good, but perhaps things have improved.

If we measure that and it's an issue then we could perhaps avoid doing this when the size is constant and small enough.

@torokati44
Copy link
Author

torokati44 commented Jan 5, 2022

What size memcpy/memset are measured here?

In the linked/attached profiles? I don't know, really. And since the memsets are mostly compiler-generated, I also don't know of any straightforward ways to tell.

Anyway, a fellow developer did this benchmark:
image
(Ignore the wasm-opt in all the legend labels, it didn't make much of a difference as of right now. The line styles each correspond to Rust stable (1.57), nightly (1.59), and I think also nightly, but with bulk-memory enabled.)

This is how he summarized it: "main point being, indeed under 8 bytes a memcpy can be 2-3x cheaper than memory.copy"

I'd say 8 bytes is a really low turnover point, I would have guessed/expected something closer to, say, a 100.

This was the code he ran:

#[inline(never)]
pub fn bench(dst: &mut Vec<u8>, src: &Vec<u8>) {
    if dst.len() == src.len() {
        dst[..].copy_from_slice(&src);
    }
}

#[wasm_bindgen]
pub fn work(n: usize, runs: usize) {
    let mut a = vec![0; n];
    let b = vec![0; n];
    for _ in 0..runs {
        bench(&mut a, &b);
    }
}

Maybe the optimization could, if the size is not a known constant, even add a branch as well, to do it one way or another, based on the number of bytes to be done...

@kripken
Copy link
Member

kripken commented Jan 5, 2022

Thanks @torokati44 ! 8 is indeed fairly small.

There is some data on typical values of memcpy/memset in the wild, like this paper (raw data). The common values are usually very small. So 8 is in the relevant range. (Those links are for LLVM-libc work, it may make sense to learn from them regarding possible solutions at the toolchain level.)

Although, the same issue could also be raised for the WASM Rust WG, because the included memset and memcpy implementations (with the bulk-memory codegen option not enabled) process one byte at a time (partially unrolled to batches of 8), with i32 intermediates. It could easily copy/set 4 or even 8 bytes at a time. Not to mention v128 at a time if the SIMD extension is enabled.

I would recommend doing that. Copying 4 bytes at a time is what emscripten does atm, for example.

And yes, SIMD can do even better. Potentially it could compete with the browser's internal copying.

Maybe the optimization could, if the size is not a known constant, even add a branch as well, to do it one way or another, based on the number of bytes to be done...

Also reasonable. For comparison, in emscripten if the size is large enough we call out to JS and use typed array .set() which is very fast. I assume that uses the same memcpy that bulk memory would but has better backwards compatibility.

@torokati44
Copy link
Author

torokati44 commented Jan 13, 2022

Just a heads-up: With ruffle-rs/ruffle#5834 just merged, and Rust 1.58 (with its improved built-in memory loops) also being right around the corner, this got a lot less important for me personally, so I don't think I'll pursue this any further in the near future. While it does sound really interesting, I only have so much free time, and so many things I want to do.
If anyone wants to pick it up, feel free to look at the branch linked above in my fork; or take some ideas from here, which looks like it does a really similar thing, albeit for a different IR: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants