Skip to content

Releases: ashvardanian/SimSIMD

v3.5.0

31 Oct 04:32
Compare
Choose a tag to compare

3.5.0 (2023-10-31)

Add

Docs

Fix

  • Avoid ifnan compilation issues for GBench (fe3286f)
  • normalize vectors for JS tests (7f9c6df)
  • SciPy JS uses square root (d8b9762)

Improve

  • Parameterize epsilon for different types (2a49d9e)
  • Randomize NumPy seed on every run (a49b866)
  • Same epsilon for JS/KL backends (a70479f)
  • use GlibC in CPython bindings (65fe343)

Test

  • Compare our f16 to SciPy f64 (dd655c1)

v3.4.0

31 Oct 04:16
Compare
Choose a tag to compare

3.4.0 (2023-10-31)

Add

v3.3.0

27 Oct 06:07
Compare
Choose a tag to compare

3.3.0 (2023-10-27)

Add

Fix

  • AVX2 int8 angular distance (143aa34)
  • Use rtol for L2sq and atol for other (89a61b3)

Improve

  • goto to avoid more conditions (ef71253)
  • Run benchmarks on 1 thread (2e1a714)
  • Use BMI2 and AVX-512VNNI for masks & fma (161eee9)

Make

Test

  • distances for int8 arrays (c1c06ba)
  • Normalize bitwise distances (3edfe0b)

Beating GCC 12 - 118x Speedup for Jensen Shannon Divergence via AVX-512FP16

23 Oct 23:48
Compare
Choose a tag to compare

Divergence functions are a bit more complex than the Cosine Similarity, primarily because they have to compute logarithms, which are relatively slow when using LibC's logf.

So, aside from minor patches, in this PR, I've rewritten the Jensen Shannon distances leveraging several optimizations, mainly focusing on AVX-512 and AVX-512FP16 extensions, which resulted in 4.6x improvement over the auto-vectorized single-precision variant and a whopping 118x improvement over the half-precision code produced by GCC 12.

Optimizations

  • Logarithm Computation. Instead of multiple bitwise operations, _mm512_getexp_ph and _mm512_getmant_ph are now used to extract the exponent and the mantissa of the floating-point number, streamlining the process. I've also used Horner's method for the polynomial approximation.
  • Division Avoidance. To avoid expensive division operations, reciprocal approximations are utilized - _mm512_rcp_ph for half-precision and _mm512_rcp14_ps for single-precision. The _mm512_rcp28_ps was found to be unnecessary for this implementation.
  • Handling Zeros. The _mm512_cmp_ph_mask is used to compute a mask for close-to-zero values, avoiding the addition of an "epsilon" to every component, which is both cleaner and more accurate.
  • Parallel Accumulation. The accumulation of $KL(P||Q)$ and $KL(Q||P)$ are now handled in separate registers, and the masked _mm512_maskz_fmadd_ph replaces distinct addition and multiplication operations, optimizing the calculation further.

Implementation

To remind, the Jensen Shannon divergence is the symmetric version of the Kullback-Leibler divergence:

$$JSD(P, Q) = \frac{1}{2} D(P || M) + \frac{1}{2} D(Q || M) \\$$ $$M = \frac{1}{2}(P + Q), D(P || Q) = \sum P(i) \cdot \log \left( \frac{P(i)}{Q(i)} \right)$$

For AVX-512FP16, the current implementation looks like this:

__attribute__((target("avx512f,avx512vl,avx512fp16")))
inline __m512h simsimd_avx512_f16_log2(__m512h x) {
    // Extract the exponent and mantissa
    __m512h one = _mm512_set1_ph((_Float16)1);
    __m512h e = _mm512_getexp_ph(x);
    __m512h m = _mm512_getmant_ph(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_src);

    // Compute the polynomial using Horner's method
    __m512h p = _mm512_set1_ph((_Float16)-3.4436006e-2f);
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)3.1821337e-1f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)-1.2315303f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)2.5988452f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)-3.3241990f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)3.1157899f));

    return _mm512_add_ph(_mm512_mul_ph(p, _mm512_sub_ph(m, one)), e);
}

__attribute__((target("avx512f,avx512vl,avx512fp16")))
inline static simsimd_f32_t simsimd_avx512_f16_js(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t n) {
    __m512h sum_a_vec = _mm512_set1_ph((_Float16)0);
    __m512h sum_b_vec = _mm512_set1_ph((_Float16)0);
    __m512h epsilon_vec = _mm512_set1_ph((_Float16)1e-6f);
    for (simsimd_size_t i = 0; i < n; i += 32) {
        __mmask32 mask = n - i >= 32 ? 0xFFFFFFFF : ((1u << (n - i)) - 1u);
        __m512h a_vec = _mm512_castsi512_ph(_mm512_maskz_loadu_epi16(mask, a + i));
        __m512h b_vec = _mm512_castsi512_ph(_mm512_maskz_loadu_epi16(mask, b + i));
        __m512h m_vec = _mm512_mul_ph(_mm512_add_ph(a_vec, b_vec), _mm512_set1_ph((_Float16)0.5f));

        // Avoid division by zero problems from probabilities under zero down the road.
        // Masking is a nicer way to do this, than adding the `epsilon` to every component.
        __mmask32 nonzero_mask_a = _mm512_cmp_ph_mask(a_vec, epsilon_vec, _CMP_GE_OQ);
        __mmask32 nonzero_mask_b = _mm512_cmp_ph_mask(b_vec, epsilon_vec, _CMP_GE_OQ);
        __mmask32 nonzero_mask = nonzero_mask_a & nonzero_mask_b & mask;

        // Division is an expensive operation. Instead of doing it twice,
        // we can approximate the reciprocal of `m` and multiply instead.
        __m512h m_recip_approx = _mm512_rcp_ph(m_vec);
        __m512h ratio_a_vec = _mm512_mul_ph(a_vec, m_recip_approx);
        __m512h ratio_b_vec = _mm512_mul_ph(b_vec, m_recip_approx);

        // The natural logarithm is equivalent to `log2`, multiplied by the `loge(2)`
        __m512h log_ratio_a_vec = simsimd_avx512_f16_log2(ratio_a_vec);
        __m512h log_ratio_b_vec = simsimd_avx512_f16_log2(ratio_b_vec);

        // Instead of separate multiplication and addition, invoke the FMA
        sum_a_vec = _mm512_maskz_fmadd_ph(nonzero_mask, a_vec, log_ratio_a_vec, sum_a_vec);
        sum_b_vec = _mm512_maskz_fmadd_ph(nonzero_mask, b_vec, log_ratio_b_vec, sum_b_vec);
    }
    simsimd_f32_t log2_normalizer = 0.693147181f;
    return _mm512_reduce_add_ph(_mm512_add_ph(sum_a_vec, sum_b_vec)) * 0.5f * log2_normalizer;
}

Benchmarks

I conducted benchmarks at both the higher-level Python and lower-level C++ layers, comparing the auto-vectorization on GCC 12 to our new implementation on an Intel Sapphire Rapids CPU on AWS:

Screenshot 2023-10-23 at 14 33 03

The program was compiled with -O3 and -ffast-math and was running on all cores of the 4-core instance, potentially favoring the non-vectorized solution. When normalized and tabulated, the results are as follows:

Benchmark Pairs/s Gigabytes/s Absolute Error Relative Error
serial_f32_js_1536d 0.243 M/s 2.98 G/s 0 0
serial_f16_js_1536d 0.018 M/s 0.11 G/s 0.123 0.035
avx512_f32_js_1536d 1.127 M/s 13.84 G/s 0.001 345u
avx512_f16_js_1536d 2.139 M/s 13.14 G/s 0.070 0.020
avx2_f16_js_1536d 0.547 M/s 3.36 G/s 0.011 0.003

Of course, the results will vary depending on the vector size. I generally use 1536 dimensions, matching the size of OpenAI Ada embeddings, standard in NLP workloads. The Jensen Shannon divergence, however, is used broadly in other domains of statistics, bio-informatics, and chem-informatics, so I'm adding it as a new out-of-the-box supported metric into USearch today 🥳

This further accelerates the k-approximate Nearest Neighbors Search and the clustering of Billions of different protein sequences without alignment procedures. Expect one more "Less Slow" post soon! 🤗

v3.1.1

22 Oct 23:50
Compare
Choose a tag to compare

3.1.1 (2023-10-22)

Fix

  • Prefer AVX512 to AVX2 and SVE to NEON (2bb2c57)

v3.0.0

21 Oct 05:11
Compare
Choose a tag to compare

3.0.0 (2023-10-21)

Add

Break

Fix

  • Dispatching serial implementation (4026fbf)

Improve

  • kullbackleibler benchmarks (cde40cb)

Revert

  • asm/hwcap.h for Arm caps on Linux (7b2603c)

v2.4.0

17 Oct 07:48
Compare
Choose a tag to compare

2.4.0 (2023-10-17)

Add

Fix

  • Check _Float16 availability on x86 (93aab8d)
  • Disable f16 for NodeJS (1587375)

Make

v2.3.8

10 Oct 21:02
Compare
Choose a tag to compare

2.3.8 (2023-10-10)

Improve

  • Use fast calling convention (8e736a9)

v2.3.7

09 Oct 16:36
Compare
Choose a tag to compare

2.3.7 (2023-10-09)

Fix

  • Simplify control flow (ceea3c8)
  • Type-conversion in default distance functions (af45f87)

v2.3.6

09 Oct 14:53
Compare
Choose a tag to compare

2.3.6 (2023-10-09)

Fix