[libc][SVE] add sve handling for memcpy with count less than 32b#167446
[libc][SVE] add sve handling for memcpy with count less than 32b#167446SchrodingerZhu merged 2 commits intollvm:mainfrom
Conversation
|
@llvm/pr-subscribers-libc Author: Schrodinger ZHU Yifan (SchrodingerZhu) ChangesFull diff: https://github.com/llvm/llvm-project/pull/167446.diff 1 Files Affected:
diff --git a/libc/src/string/memory_utils/aarch64/inline_memcpy.h b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
index 11cf022e12b1f..7af0963c7e11e 100644
--- a/libc/src/string/memory_utils/aarch64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/aarch64/inline_memcpy.h
@@ -9,15 +9,40 @@
#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_AARCH64_INLINE_MEMCPY_H
#include "src/__support/macros/attributes.h" // LIBC_INLINE
+#include "src/__support/macros/properties/cpu_features.h"
#include "src/string/memory_utils/op_builtin.h"
#include "src/string/memory_utils/utils.h"
#include <stddef.h> // size_t
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+#include <arm_sve.h>
+#endif
namespace LIBC_NAMESPACE_DECL {
-
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+[[maybe_unused, gnu::always_inline]] LIBC_INLINE void
+inline_memcpy_aarch64_sve_32_bytes(Ptr __restrict dst, CPtr __restrict src,
+ size_t count) {
+ auto src_ptr = reinterpret_cast<const uint8_t *>(src);
+ auto dst_ptr = reinterpret_cast<uint8_t *>(dst);
+ const size_t vlen = svcntb();
+ svbool_t less_than_count_fst = svwhilelt_b8_u64(0, count);
+ svbool_t less_than_count_snd = svwhilelt_b8_u64(vlen, count);
+ svuint8_t fst = svld1_u8(less_than_count_fst, &src_ptr[0]);
+ svuint8_t snd = svld1_u8(less_than_count_snd, &src_ptr[vlen]);
+ svst1_u8(less_than_count_fst, &dst_ptr[0], fst);
+ svst1_u8(less_than_count_snd, &dst_ptr[vlen], snd);
+}
+#endif
[[maybe_unused]] LIBC_INLINE void
inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
+#if defined(LIBC_TARGET_CPU_HAS_SVE)
+ // SVE register is at least 16 bytes, we can use it to avoid branching in
+ // small cases. Here we use 2 SVE registers to always cover cases where count
+ // <= 32.
+ if (count <= 32)
+ return inline_memcpy_aarch64_sve_32_bytes(dst, src, count);
+#else
if (count == 0)
return;
if (count == 1)
@@ -34,6 +59,7 @@ inline_memcpy_aarch64(Ptr __restrict dst, CPtr __restrict src, size_t count) {
return builtin::Memcpy<8>::head_tail(dst, src, count);
if (count < 32)
return builtin::Memcpy<16>::head_tail(dst, src, count);
+#endif
if (count < 64)
return builtin::Memcpy<32>::head_tail(dst, src, count);
if (count < 128)
|
|
Adding @gchatelet as the expert on memory optimizations |
| // SVE register is at least 16 bytes, we can use it to avoid branching in | ||
| // small cases. Here we use 2 SVE registers to always cover cases where count | ||
| // <= 32. | ||
| if (count <= 32) |
There was a problem hiding this comment.
Not sure how common the zero-size case is, but this introduces four memory-subsystem accesses for it. I know they won't fault, but I believe something does go out and do some work outside the processor. I would need to review how this affects cache behavior. Programmatically it should be (and is a no-op), and obviously won't affect the abstract machine's state, but I'm less confident of the memory subsystem.
Not sure it is worth it to change.
|
@Sterling-Augustine However, since there are possibilities for much longer vectors. Do you think it may be useful to change this to |
I'm in favor. Makes for a free performance bonus when new hardware appears. |
|
I'm just seeing this patch, please give me one more day so I can review it correctly 🙏 |
|
SVE vectors are at least 128-bit wide (16B). In case the CPU implementation offers wider vectors the code becomes suboptimal as it presumably goes through two (or more) predicated loads and stores which would extend access to 64B (or more). As @Sterling-Augustine mentioned, it seems wasteful to "load"/"store" 32B if we copy To me it seems that SVE becomes increasingly interesting for large sizes not for small ones. Now an interesting experience would be to fully write #include <arm_sve.h>
#include <cstddef>
#include <cstdint>
void memcpy(void* dest, const void* src, size_t n) {
uint8_t* dest_b = (uint8_t*)dest;
const uint8_t* src_b = (const uint8_t*)src;
size_t i = 0;
const uint64_t vl = svcntb();
svbool_t p;
do {
p = svwhilelt_b8(i, n);
const svuint8_t vec_data = svld1_u8(p, src_b + i);
svst1_u8(p, dest_b + i, vec_data);
i += vl;
} while (svptest_first(svptrue_b8(), p));
}memcpy(void*, void const*, unsigned long):
mov x8, xzr
.LBB0_1:
whilelo p0.b, x8, x2
ld1b { z0.b }, p0/z, [x1, x8]
st1b { z0.b }, p0, [x0, x8]
incb x8
b.mi .LBB0_1
retMy gut feeling is that it would still be beneficial to handle small sizes (up to 16B) through GPRs as it is done today and only then to switch to SVE. This way we have the best of both worlds. I think this deserves a few experiments on different ARM platforms (neoverse N1 and N2) before landing anyways. I've seen implementations that perform great on microbenchmarks but end up being neutral or negative in production, especially on different micro-architectures. |
For the SVE long loop, according to my observation in #167624, we still need to parallelize the loop with two or more whilelo incremented by vlen to get optimal performance in microbench. Would it be hard to find a sweet point on the amount of parallelism? For now, it seems that 16b vector may be in favor of more (may be 4) explicit parallelism. However, I would expect longer vector may impose longer latency and hence the number would change. |
|
I agree, I think a better scheme is to keep if-cascade for Another track to explore is the performance of MOPS for architectures that supports it ( I believe ARMv8.8) |
|
There are internal discussions going on about mem-functions for aarch64, can we postpone this patch until we reach a conclusion? |
|
Sure |
|
@gchatelet Hi, I wonder if there is any update on this? |
Yes, the patch seems promising but we need to run more experiments internally.
I'll report back ASAP. |
|
Sounds interesting. Looking forward to hearing more information on this. |
|
Another ping on this~ |
|
We finally have good evidence that the patch is positive in production. Something along these lines if (count == 0) return;
#ifdef LIBC_TARGET_CPU_HAS_SVE
auto src_ptr = reinterpret_cast<const uint8_t*>(src);
auto dst_ptr = reinterpret_cast<uint8_t*>(dst);
if (count <= 16) {
const svbool_t mask = svwhilelt_b8_u64(0, count);
svst1_u8(mask, dst_ptr, svld1_u8(mask, src_ptr));
return;
}
if (count <= 32) {
const size_t vlen = svcntb();
svbool_t m0 = svwhilelt_b8_u64(0, count);
svbool_t m1 = svwhilelt_b8_u64(vlen, count);
svst1_u8(m0, dst_ptr, svld1_u8(m0, src_ptr));
svst1_u8(m1, dst_ptr + vlen, svld1_u8(m1, src_ptr + vlen));
return;
}
#else
if (count == 1) return builtin::Memcpy<1>::block(dst, src);
...I'm still a bit bothered by how this patch will age when |
Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>
a383cc9 to
fd51618
Compare
gchatelet
left a comment
There was a problem hiding this comment.
Thx for bearing with me @SchrodingerZhu 🙏
…m#167446) Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching. Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925): ``` ====================================================================== BENCHMARK STATISTICS (time in nanoseconds) ====================================================================== memcpy_Google_A: Old - Mean: 3.1257 ns, Median: 3.1162 ns New - Mean: 2.8402 ns, Median: 2.8265 ns Improvement: +9.14% (mean), +9.30% (median) memcpy_Google_B: Old - Mean: 2.3171 ns, Median: 2.3159 ns New - Mean: 1.6589 ns, Median: 1.6593 ns Improvement: +28.40% (mean), +28.35% (median) memcpy_Google_D: Old - Mean: 8.7602 ns, Median: 8.7645 ns New - Mean: 8.4307 ns, Median: 8.4308 ns Improvement: +3.76% (mean), +3.81% (median) memcpy_Google_L: Old - Mean: 1.7137 ns, Median: 1.7091 ns New - Mean: 1.4530 ns, Median: 1.4553 ns Improvement: +15.22% (mean), +14.85% (median) memcpy_Google_M: Old - Mean: 1.9823 ns, Median: 1.9825 ns New - Mean: 1.4826 ns, Median: 1.4840 ns Improvement: +25.20% (mean), +25.15% (median) memcpy_Google_Q: Old - Mean: 1.6812 ns, Median: 1.6784 ns New - Mean: 1.1538 ns, Median: 1.1517 ns Improvement: +31.37% (mean), +31.38% (median) memcpy_Google_S: Old - Mean: 2.1816 ns, Median: 2.1786 ns New - Mean: 1.6297 ns, Median: 1.6287 ns Improvement: +25.29% (mean), +25.24% (median) memcpy_Google_U: Old - Mean: 2.2851 ns, Median: 2.2825 ns New - Mean: 1.7219 ns, Median: 1.7187 ns Improvement: +24.65% (mean), +24.70% (median) memcpy_Google_W: Old - Mean: 2.0408 ns, Median: 2.0361 ns New - Mean: 1.5260 ns, Median: 1.5252 ns Improvement: +25.23% (mean), +25.09% (median) uniform_384_to_4096: Old - Mean: 26.9067 ns, Median: 26.8845 ns New - Mean: 26.8083 ns, Median: 26.8149 ns Improvement: +0.37% (mean), +0.26% (median) ``` The beginning of the memcpy function looks like the following: ``` Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm: 0x0000000000001340 <+0>: cbz x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252> 0x0000000000001344 <+4>: cbz x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256> 0x0000000000001348 <+8>: cbz x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260> 0x000000000000134c <+12>: subs x8, x2, #0x20 0x0000000000001350 <+16>: b.hi 0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52> // b.pmore 0x0000000000001354 <+20>: rdvl x8, #1 0x0000000000001358 <+24>: whilelo p0.b, xzr, x2 0x000000000000135c <+28>: ld1b {z0.b}, p0/z, [x1] 0x0000000000001360 <+32>: whilelo p1.b, x8, x2 0x0000000000001364 <+36>: ld1b {z1.b}, p1/z, [x1, #1, mul vl] 0x0000000000001368 <+40>: st1b {z0.b}, p0, [x0] 0x000000000000136c <+44>: st1b {z1.b}, p1, [x0, #1, mul vl] 0x0000000000001370 <+48>: ret ``` --------- Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>
…m#167446) Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching. Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925): ``` ====================================================================== BENCHMARK STATISTICS (time in nanoseconds) ====================================================================== memcpy_Google_A: Old - Mean: 3.1257 ns, Median: 3.1162 ns New - Mean: 2.8402 ns, Median: 2.8265 ns Improvement: +9.14% (mean), +9.30% (median) memcpy_Google_B: Old - Mean: 2.3171 ns, Median: 2.3159 ns New - Mean: 1.6589 ns, Median: 1.6593 ns Improvement: +28.40% (mean), +28.35% (median) memcpy_Google_D: Old - Mean: 8.7602 ns, Median: 8.7645 ns New - Mean: 8.4307 ns, Median: 8.4308 ns Improvement: +3.76% (mean), +3.81% (median) memcpy_Google_L: Old - Mean: 1.7137 ns, Median: 1.7091 ns New - Mean: 1.4530 ns, Median: 1.4553 ns Improvement: +15.22% (mean), +14.85% (median) memcpy_Google_M: Old - Mean: 1.9823 ns, Median: 1.9825 ns New - Mean: 1.4826 ns, Median: 1.4840 ns Improvement: +25.20% (mean), +25.15% (median) memcpy_Google_Q: Old - Mean: 1.6812 ns, Median: 1.6784 ns New - Mean: 1.1538 ns, Median: 1.1517 ns Improvement: +31.37% (mean), +31.38% (median) memcpy_Google_S: Old - Mean: 2.1816 ns, Median: 2.1786 ns New - Mean: 1.6297 ns, Median: 1.6287 ns Improvement: +25.29% (mean), +25.24% (median) memcpy_Google_U: Old - Mean: 2.2851 ns, Median: 2.2825 ns New - Mean: 1.7219 ns, Median: 1.7187 ns Improvement: +24.65% (mean), +24.70% (median) memcpy_Google_W: Old - Mean: 2.0408 ns, Median: 2.0361 ns New - Mean: 1.5260 ns, Median: 1.5252 ns Improvement: +25.23% (mean), +25.09% (median) uniform_384_to_4096: Old - Mean: 26.9067 ns, Median: 26.8845 ns New - Mean: 26.8083 ns, Median: 26.8149 ns Improvement: +0.37% (mean), +0.26% (median) ``` The beginning of the memcpy function looks like the following: ``` Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm: 0x0000000000001340 <+0>: cbz x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252> 0x0000000000001344 <+4>: cbz x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256> 0x0000000000001348 <+8>: cbz x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260> 0x000000000000134c <+12>: subs x8, x2, #0x20 0x0000000000001350 <+16>: b.hi 0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52> // b.pmore 0x0000000000001354 <+20>: rdvl x8, llvm#1 0x0000000000001358 <+24>: whilelo p0.b, xzr, x2 0x000000000000135c <+28>: ld1b {z0.b}, p0/z, [x1] 0x0000000000001360 <+32>: whilelo p1.b, x8, x2 0x0000000000001364 <+36>: ld1b {z1.b}, p1/z, [x1, llvm#1, mul vl] 0x0000000000001368 <+40>: st1b {z0.b}, p0, [x0] 0x000000000000136c <+44>: st1b {z1.b}, p1, [x0, llvm#1, mul vl] 0x0000000000001370 <+48>: ret ``` --------- Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>

Add SVE optimization for AArch64 architectures. The idea is to use predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on largest X925):
The beginning of the memcpy function looks like the following: