-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prefetch instruction #1364
Comments
Wonder if this instruction could be similar to what we talked about with the instrumentation instruction. Maybe specialty instructions. Not sure we want to group them together but effectively is supported will execute. If not supported will be ignored. The instrumentation instruction takes a byte as an immediate. This would probably take a i32 as an immediate? |
Yes, this is another instance of a proposed nop instruction with unmodeled side effects, just like the ITT proposal. @fbarchard How would you expect prefetch instructions to be ordered with respect to other instructions in an optimizing WebAssembly engine? You wouldn't want it to be moved around too much, but if you don't let anything move past it, it might inhibit optimizations and no longer be beneficial. |
I can let frank reply too. As you said one or two instruction move in the prefetch may not matter similar to the ITT proposal. We just wouldn't want it to be moved inside a nested loop block. loop before is okay. loop after would be bad if that makes sense? |
There are 2 cases to test performance on
The location of the prefetch doesnt matter all that much in this case.
The location of the prefetch instruction matters more here. Here's the example of gauss (from libyuv) with prefetch on arm. I've scheduled them after math instructions
|
I think what Thomas is concerned with his when you put a prefetch instruction it may skid past where you may think it will be in the machine code? As long as it doesn't fall inside a code path that is an address that is jumped to I think it would be okay. If this makes sense? If the prefetch falls below an address with a target jump the prefetch may end up in a loop we don't want it to prefetch e.g. a nested loop. |
Here's an assembly function where I add a proposed WASM prefetch instruction.
|
How did you determine that 7 is a good number of cache lines ahead to prefetch at? There are several ways that explicit prefetches can end up making programs slower. Can you discuss what guidance we might be able to give wasm producers and developers as to how to use prefetches to get reliable speedups? |
In libyuv I inserted 223 prfm instructions all aarch64 functions and tried versions with each offset and then ran on all 64 bit cpus. 448 was the fastest overall. For the initial test I used 720p, a realistic but larger size. Here's an example
I tested the other forms of prefetch for arm and only got small wins on prefetching destination and streaming and L2. L1 KEEP was the useful variation on ARM. The location is not critical for when it is effective, just when it is not. Much like a nop, you need to be able to schedule it somewhere that has no impact. After all loads, during math is a good location. I'd prefer the v8 runtime compiler not reorder, so we can try the instruction in different locations. Complex functions will need several prefetches and they should be scheduled based on memory order and when cycles are available. |
Could you add a prefetch instruction for WASM?
The syntax would be similar to a load
The load instruction uses a local.get for the base pointer, and an immediate offset:, and then puts the result back on the stack.
a prefetch has a base and offset, but no result
On arm64 this implements with
prfm pldl1keep, [src, 448]
On x64
prefetcht0 448(src)
Performance improvement varies by cpu and function. On Cortex A53 a typical speed up is 20%.
On Atom Cederview, an ssse3 function to convert rgb to grey scale speeds up 9%.
The text was updated successfully, but these errors were encountered: