Skip to content

Releases: ashvardanian/SimSIMD

New `bf16` & faster `i8` kernels ❀️ AMD Genoa

20 Jun 03:56
Compare
Choose a tag to compare

New bf16 kernels

The "brain-float-16" is a popular machine learning format. It's broadly supported in hardware and is very machine-friendly, but software support is still lagging behind - numpy/numpy#19808. Most importantly, low-precision bf16 dot-products are supported by the most recent Zen4-based AMD Genoa CPUs. Those have up-to 96 cores, and just one of those cores is capable of computing 86 GB/s worth of such dot-products.

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
dot_bf16_haswell_1536d/min_time:10.000/threads:1         203 ns          203 ns     68785823 abs_delta=29.879n bytes=30.1978G/s pairs=4.91501M/s relative_error=39.8289n
dot_bf16_haswell_1536b/min_time:10.000/threads:1        93.0 ns         93.0 ns    150582910 abs_delta=24.8365n bytes=33.0344G/s pairs=10.7534M/s relative_error=33.1108n
dot_bf16_genoa_1536d/min_time:10.000/threads:1          71.0 ns         71.0 ns    197340105 abs_delta=23.6042n bytes=86.5917G/s pairs=14.0937M/s relative_error=31.4977n
dot_bf16_genoa_1536b/min_time:10.000/threads:1          36.1 ns         36.1 ns    387637713 abs_delta=22.3063n bytes=85.0019G/s pairs=27.6699M/s relative_error=29.7341n
dot_bf16_serial_1536d/min_time:10.000/threads:1        15992 ns        15991 ns       874491 abs_delta=311.896n bytes=384.216M/s pairs=62.5352k/s relative_error=415.887n
dot_bf16_serial_1536b/min_time:10.000/threads:1         7979 ns         7978 ns      1754703 abs_delta=193.719n bytes=385.045M/s pairs=125.34k/s relative_error=258.429n
dot_bf16c_serial_1536d/min_time:10.000/threads:1       16430 ns        16429 ns       852438 abs_delta=251.692n bytes=373.964M/s pairs=60.8665k/s relative_error=336.065n
dot_bf16c_serial_1536b/min_time:10.000/threads:1        8207 ns         8202 ns      1707289 abs_delta=165.209n bytes=374.54M/s pairs=121.92k/s relative_error=220.35n
vdot_bf16c_serial_1536d/min_time:10.000/threads:1      16489 ns        16488 ns       849194 abs_delta=247.646n bytes=372.639M/s pairs=60.6509k/s relative_error=330.485n
vdot_bf16c_serial_1536b/min_time:10.000/threads:1       8224 ns         8217 ns      1704397 abs_delta=162.036n bytes=373.839M/s pairs=121.693k/s relative_error=216.042n

That's a steep 3x improvement over single-precision FMA throughput we can obtain by simply shifting bf16 left by 16 bits and using _mm256_fmadd_ps intrinsic / vfmadd instruction available since Intel Haswell.

Faster i8 kernels

We can't directly use _mm512_dpbusd_epi32 every time we want to compute a low-precision integer dot-product, as it's asymmetric with respect to the sign of the input arguments:

Signed(ZeroExtend16(a.byte[4j]) * SignExtend16(b.byte[4j]))

In the past we would just upcast to 16-bit integers and resort to _mm512_dpwssds_epi32. It is a much more costly multiplication circuit, and, assuming that I avoid loop unrolling, also implies 2x fewer scalars per loop. But for cosine distances there is something simple we can do. Assuming that we multiply the vector by itself, even if a certain vector component is negative, its square will always be positive. So we can avoid the expensive 16-bit operation at least where we compute the vector norms:

    a_abs_vec = _mm512_abs_epi8(a_vec);
    b_abs_vec = _mm512_abs_epi8(b_vec);
    a2_i32s_vec = _mm512_dpbusds_epi32(a2_i32s_vec, a_abs_vec, a_abs_vec);
    b2_i32s_vec = _mm512_dpbusds_epi32(b2_i32s_vec, b_abs_vec, b_abs_vec);

On Intel Sapphire Rapids it resulted in a higher single-thread utilization, but didn't lead to improvements on other platforms.

---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
cos_i8_haswell_1536d/min_time:10.000/threads:1       92.4 ns         92.4 ns    151487077 abs_delta=105.739u bytes=33.2344G/s pairs=10.8185M/s relative_error=405.868u
cos_i8_haswell_1536b/min_time:10.000/threads:1       92.4 ns         92.4 ns    151478714 abs_delta=0 bytes=33.2383G/s pairs=10.8198M/s relative_error=0
cos_i8_ice_1536d/min_time:10.000/threads:1           61.6 ns         61.6 ns    227445214 abs_delta=0 bytes=49.898G/s pairs=16.2428M/s relative_error=0
cos_i8_ice_1536b/min_time:10.000/threads:1           61.5 ns         61.5 ns    227609621 abs_delta=0 bytes=49.9167G/s pairs=16.2489M/s relative_error=0
cos_i8_serial_1536d/min_time:10.000/threads:1         299 ns          299 ns     46788061 abs_delta=0 bytes=10.2666G/s pairs=3.34198M/s relative_error=0
cos_i8_serial_1536b/min_time:10.000/threads:1         299 ns          299 ns     46787275 abs_delta=0 bytes=10.2663G/s pairs=3.34191M/s relative_error=0

New timings:

---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
cos_i8_haswell_1536d/min_time:10.000/threads:1       92.4 ns         92.4 ns    151463294 abs_delta=105.739u bytes=33.2359G/s pairs=10.819M/s relative_error=405.868u
cos_i8_haswell_1536b/min_time:10.000/threads:1       92.4 ns         92.4 ns    151470519 abs_delta=0 bytes=33.2392G/s pairs=10.82M/s relative_error=0
cos_i8_ice_1536d/min_time:10.000/threads:1           48.1 ns         48.1 ns    292087642 abs_delta=0 bytes=63.8408G/s pairs=20.7815M/s relative_error=0
cos_i8_ice_1536b/min_time:10.000/threads:1           48.2 ns         48.2 ns    291716009 abs_delta=0 bytes=63.7662G/s pairs=20.7572M/s relative_error=0
cos_i8_serial_1536d/min_time:10.000/threads:1         299 ns          299 ns     46784120 abs_delta=0 bytes=10.2647G/s pairs=3.34139M/s relative_error=0
cos_i8_serial_1536b/min_time:10.000/threads:1         299 ns          299 ns     46781350 abs_delta=0 bytes=10.2654G/s pairs=3.3416M/s relative_error=0

v4.3.1

10 Apr 18:59
Compare
Choose a tag to compare

4.3.1 (2024-04-10)

Make

v4.3.0

08 Apr 18:11
Compare
Choose a tag to compare

4.3.0 (2024-04-08)

Add

  • toBinary for JavaScript (1f1fd3a)

Improve

v4.2.2

31 Mar 21:51
Compare
Choose a tag to compare

4.2.2 (2024-03-31)

Fix

  • f16 casting on Arm (dda096b)
  • Dot-products compilation on SVE (f5fe36d)

Make

v4.2.1

26 Mar 05:56
Compare
Choose a tag to compare

4.2.1 (2024-03-26)

Fix

Make

  • Avoid native f16 by default (bd02af2)
  • Shorter Rust compilation (db9a9d2)

v4.2.0

25 Mar 05:52
Compare
Choose a tag to compare

4.2.0 (2024-03-25)

Add

Fix

Make

v4.1.1

24 Mar 01:25
Compare
Choose a tag to compare

4.1.1 (2024-03-24)

Fix

Make

  • Unused function attributes (c85e098)

SIMD on Windows πŸ€— πŸͺŸ

23 Mar 19:32
Compare
Choose a tag to compare

This release refactors compiler attributes and intrinsics usage to make it compatible with MSVC. Most noticeably, function defined like this:

__attribute__((target("+simd")))
inline static void simsimd_cos_f32_neon(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) { }

Now look like this:

#pragma GCC push_options
#pragma GCC target("+simd")
#pragma clang attribute push(__attribute__((target("+simd"))), apply_to = function)

inline static void simsimd_cos_f32_neon(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) { }

#pragma clang attribute pop
#pragma GCC pop_options

Thanks to that SimSIMD on Windows is gonna be just as fast as on Linux and MacOS πŸ€— πŸͺŸ


4.1.0 vs 4.0.0 Change Log (2024-03-23)

Add

  • Bench against TF, PyTorch, JAX (5e64152)
  • Complex dot-products for Rust (936fe73)
  • Double-precision interfaces for JS (07f2aca)

Docs

Fix

  • Avoid vld2_f16 on MSVC (ce9800e)
  • Complex dispatch in Rust & C (d349bcd)
  • Missing _mm_rsqrt14_ps in MSVC (21e30fe)
  • Missing float16_t in MSVC Arm64 builds (94442c3)

Improve

Make

v4.0.0

13 Mar 19:08
Compare
Choose a tag to compare

This is a packed redesign! Let's start with what's cool about it and later cover the mechanics.

  1. Extends dot products covering the entire matrix:
    • all IEEE 754 floating-point formats (f16, f32, f64)
    • real, complex, complex-conjugate dot-products
    • Arm NEON & SVE, x86 Haswell, Skylake, Ice Lake, Sapphire Rapids
  2. Add support for complex32 Python type ... that:

SimSIMD is now the fastest and most popular library for computing half-precision products/similarities for Fourier Series and other complex data πŸ₯³


What breaks:

  • Return types are now 64-bit floats, up from 32.
  • Inner products are now defined as AB, instead of 1 - AB for broader applicability.

v3.9.0

04 Mar 01:59
Compare
Choose a tag to compare

3.9.0 (2024-03-04)

Add

  • Complex numbers support (0a0665a)
  • Hamming & Jaccard for pre-AVX512 CPUs (4f1eba1), closes #69
  • Rust binary distances (960af05), closes #84

Fix

Improve

Make