-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[RISC-V] Use getUnrollThreshold() to get unroll limits #98798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Btw, I was wondering about this: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Buffer.cs#L5
It enables block store/copy on Loongarch but doesn't on RISC-V - is it expected?
Or in other words, what is the RISC-V codegen for: void Foo(ref byte dst, ref byte src)
{
Unsafe.CopyBlockUnaligned(ref dst, ref src, 64);
} ? |
d5af89a
to
433f711
Compare
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
hi,@clamp03 |
@monstercatss Thank you so much! |
Diff results for #98798Throughput diffsThroughput diffs for osx/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here |
Yes, the ISA we're currently targeting is RV64GC, no "V" (vector ops) extension. |
SIMD is not necessary here, void Foo(ref byte dst, ref byte src)
{
Unsafe.CopyBlockUnaligned(ref dst, ref src, 64);
} is inlined in codegen whether to a set of simd or GPR stores (supports such blocks) or MEMSET call (doesn't support). + // | riscv64 | 64 | 32 | no SIMD support change in this PR, it should support them. |
2fa1e40
to
321b134
Compare
In that case, I think you're right. The only caveat I can think of is the one discussed here. Currently Unsafe.CopyBlockUnaligned generates misaligned GPR loads/stores; in theory the ISA we're compiling against permits not supporting them (that's RV32, and I don't see RV64 overriding that). In practice I don't know of a general-purpose RISC-V 64-bit implementation that doesn't support misaligned load/stores for non-atomic instructions, RVA20U64 profile even mandates it. We're currently focusing on bugfixing and missing features, codegen optimizations will come later. We'll have to think what to do about it, maybe .NET should simply require it and compile with |
Totally makes sense!
I presume it can emit a set of byte stores (for that, you will need to decrease the threshold in |
If RV64GC then there are really only two sensible things to do for larger blocks:
The profile mandates that non-aligned load/store don't crash the program in User-mode programs. Implementations are free to trap to M mode and have SBI emulate it. The necessary code is always present, just in case, so for example accesses that cross a cache like or VM page might trap even if others don't. It seems VisionFive 2 does support misaligned, but not very quickly.
With -O1 this takes 11.4s vs 2.0s if the amount added to s is a multiple of 8. (I checked the codegen and it's the expected "ld;add;addiw -1;bnez") On the Lichee Pi 4A (THead C910) it's 4.65s vs 1.1s On my x86 box (2990wx) it's 0.27s either way. As a comparison, using the following takes 3.36s on VF2, so it's a bit faster than doing misaligned in hardware:
On the Lichee Pi it takes 1.65s, almost three times faster than the misaligned access. On the AMD it takes 0.51s, almost twice as slow as the misaligned access. So ... avoid misaligned access if possible, but it's not a complete disaster if it happens sometimes, on the cores we have now, and future cores. If it's in a loop then better to call the system memcpy() I think. This will also automatically pick up a vectorised version in future CPU cores. |
btw,Is threshold just based on performance? Is it faster to use memset call when size is larger than 512 on x86 avx512? |
Native compilers prefer to unroll it too https://godbolt.org/z/vTYMP5dYc (without alignment hints, so I presume even 8 unaligned stores are better). Also, previously, memset call was non-interruptable, so GC had to wait it to finish because of cooperative mode. |
Can you please fix the formatting errors and conflicts? |
done |
#83274