Option to enable SSE2 optimization #4667

bassmang · 2023-11-17T16:33:49Z

1.f / std::sqrt(x) has become the default for inverse square root in gd.
Tracking issue to add flag to enable sse2 optimization here: #4666

Min repro:
#include <x86intrin.h>

#include
#include
#include
#include

inline float inv_sqrt(const float x) { return 1.f / std::sqrt(x); }

inline float rsqrt(const float f) {
__m128 temp = _mm_set_ss(f);
temp = _mm_rsqrt_ss(temp);
return _mm_cvtss_f32(temp);
}

int main(int argc, char* argv[]) {
constexpr size_t num_repeats = 100000000;
volatile float in = 4.f;
volatile float out;

{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = inv_sqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();

const auto diff =
    std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "inv_sqrt: " << diff.count() << " us\n";
std::cout << out << std::endl;

}

{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = rsqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();

const auto diff =
    std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "rsqrt: " << diff.count() << " us\n";
std::cout << out << std::endl;

}

return EXIT_SUCCESS;
}

on my laptop, the results are:
zwd@zwd-msft:$ g++ rsqrt.cc -O2
zwd@zwd-msft:$ ./a.out
inv_sqrt: 209601 us
0.5
rsqrt: 34340 us
0.499878

Google benchmarks:
https://github.com/VowpalWabbit/vowpal_wabbit/actions/runs/6894788120/job/18757367848

The text was updated successfully, but these errors were encountered:

zwd-ms · 2023-11-27T15:21:38Z

google benchmark should also show similar results, but it needs to prevent optimizations carefully.

however, in general we could just rely on compiler auto-vectorization to optimize this. Using -O3 -ffast-math -march=native can get much closer to the hand-written intrinsic performance.

#include <benchmark/benchmark.h>

static void BM_inv_sqrt(benchmark::State& state)
{
  volatile float x = 4.f;

  for (auto _ : state)
  {
    benchmark::DoNotOptimize(inv_sqrt(x));
    benchmark::ClobberMemory();
  }
}

static void BM_rsqrt(benchmark::State& state)
{
  volatile float x = 4.f;

  for (auto _ : state)
  {
    benchmark::DoNotOptimize(rsqrt(x));
    benchmark::ClobberMemory();
  }
}

BENCHMARK(BM_inv_sqrt);
BENCHMARK(BM_rsqrt);

BENCHMARK_MAIN();

zwd-ms · 2023-11-30T18:18:00Z

from GCC doc: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options

-mrecip

This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -ffinite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).

Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.

Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.

bassmang added the Feature Request New feature requested in system label Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to enable SSE2 optimization #4667

Option to enable SSE2 optimization #4667

bassmang commented Nov 17, 2023 •

edited

Loading

zwd-ms commented Nov 27, 2023 •

edited

Loading

zwd-ms commented Nov 30, 2023

Option to enable SSE2 optimization #4667

Option to enable SSE2 optimization #4667

Comments

bassmang commented Nov 17, 2023 • edited Loading

zwd-ms commented Nov 27, 2023 • edited Loading

zwd-ms commented Nov 30, 2023

bassmang commented Nov 17, 2023 •

edited

Loading

zwd-ms commented Nov 27, 2023 •

edited

Loading