-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacing memset
and memcpy
calls with memory.fill
and memory.copy
?
#4403
Comments
This is an interesting idea!
I would definitely want to see some measurements showing that using memory.copy and memory.fill would be faster than these Wasm implementations. I'm not sure that the difference would be that large, but it would be interesting to find out.
Since Binaryen generally avoids making assumptions about function naming conventions like these and in general might not get to see the function names, I don't think we would want to turn on a pass like this by default, e.g. I wouldn't want to include it in the passes run by
This would be really interesting as a code size optimization even if it didn't turn out to have much performance benefit. |
Oh, they can barely even be compared ... given large enough chunks of memory to work with, perhaps. And here is the same thing with The removed You can see how much faster
Alright then, CFG matching and surgery it should be. |
I have started working on something... |
Although, the same issue could also be raised for the WASM Rust WG, because the included |
@nickbabcock The current Rust 1.58.0 beta is performing somewhat better already in this regard, and it should come out as stable in just under two weeks. Maybe you too could give that a try. |
Thanks for the heads up. Spent a few minutes profiling the 1.58 beta. There does seem to be a small improvement (low single digit percentage). My workload can be characterized by small allocations, so the word-wise copies introduced in 1.58 shouldn't have too big of an impact (and idk, maybe |
I see. You can try that too with |
Nice idea! What size memcpy/memset are measured here? It definitely makes sense that big ones would get faster with this, but I worry about small ones. The VM may do a call instead of emitting code inline which can be a lot slower potentially. I vaguely remember measurements about that from a few years ago that were not good, but perhaps things have improved. If we measure that and it's an issue then we could perhaps avoid doing this when the size is constant and small enough. |
Thanks @torokati44 ! 8 is indeed fairly small. There is some data on typical values of memcpy/memset in the wild, like this paper (raw data). The common values are usually very small. So 8 is in the relevant range. (Those links are for LLVM-libc work, it may make sense to learn from them regarding possible solutions at the toolchain level.)
I would recommend doing that. Copying 4 bytes at a time is what emscripten does atm, for example. And yes, SIMD can do even better. Potentially it could compete with the browser's internal copying.
Also reasonable. For comparison, in emscripten if the size is large enough we call out to JS and use typed array |
Just a heads-up: With ruffle-rs/ruffle#5834 just merged, and Rust 1.58 (with its improved built-in memory loops) also being right around the corner, this got a lot less important for me personally, so I don't think I'll pursue this any further in the near future. While it does sound really interesting, I only have so much free time, and so many things I want to do. |
If I were to try working on a
wasm-opt
pass doing what's written in the title, would it be accepted? Gated behind thebulk-memory
extension/feature of course.The rationale I have for this is that when compiling Rust, even with the
bulk-memory
target feature (codegen option) enabled, a lot of naive (slow)memcpy
andmemset
calls are left in the result, because they are called fromstd
/core
library functions, such as__rust_alloc_zeroed
and__rust_realloc
. These can be avoided by passing thebuild-std
option to Cargo, but that is still an unstable, nightly-only feature.Of course, this sets some assumptions about what a function that happens to be named "memset" or "memcpy" is supposed to be doing, but in a lot of compiler toolchains, these are pretty much already handled as intrinsics anyway.
One alternative would be to detect some forms of memory copying or filling loops in the control flow graph instead, and only replace those with intrinsics.
What do you think?
The text was updated successfully, but these errors were encountered: