Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to enable SSE2 optimization #4667

Open
bassmang opened this issue Nov 17, 2023 · 2 comments
Open

Option to enable SSE2 optimization #4667

bassmang opened this issue Nov 17, 2023 · 2 comments
Labels
Feature Request New feature requested in system

Comments

@bassmang
Copy link
Member

bassmang commented Nov 17, 2023

1.f / std::sqrt(x) has become the default for inverse square root in gd.
Tracking issue to add flag to enable sse2 optimization here: #4666

Min repro:
#include <x86intrin.h>

#include
#include
#include
#include

inline float inv_sqrt(const float x) { return 1.f / std::sqrt(x); }

inline float rsqrt(const float f) {
__m128 temp = _mm_set_ss(f);
temp = _mm_rsqrt_ss(temp);
return _mm_cvtss_f32(temp);
}

int main(int argc, char* argv[]) {
constexpr size_t num_repeats = 100000000;
volatile float in = 4.f;
volatile float out;

{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = inv_sqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();

const auto diff =
    std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "inv_sqrt: " << diff.count() << " us\n";
std::cout << out << std::endl;

}

{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = rsqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();

const auto diff =
    std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "rsqrt: " << diff.count() << " us\n";
std::cout << out << std::endl;

}

return EXIT_SUCCESS;
}

on my laptop, the results are:
zwd@zwd-msft:$ g++ rsqrt.cc -O2
zwd@zwd-msft:
$ ./a.out
inv_sqrt: 209601 us
0.5
rsqrt: 34340 us
0.499878

Google benchmarks:
https://github.com/VowpalWabbit/vowpal_wabbit/actions/runs/6894788120/job/18757367848

@bassmang bassmang added the Feature Request New feature requested in system label Nov 17, 2023
@zwd-ms
Copy link
Collaborator

zwd-ms commented Nov 27, 2023

google benchmark should also show similar results, but it needs to prevent optimizations carefully.

however, in general we could just rely on compiler auto-vectorization to optimize this. Using -O3 -ffast-math -march=native can get much closer to the hand-written intrinsic performance.

#include <benchmark/benchmark.h>

static void BM_inv_sqrt(benchmark::State& state)
{
  volatile float x = 4.f;

  for (auto _ : state)
  {
    benchmark::DoNotOptimize(inv_sqrt(x));
    benchmark::ClobberMemory();
  }
}

static void BM_rsqrt(benchmark::State& state)
{
  volatile float x = 4.f;

  for (auto _ : state)
  {
    benchmark::DoNotOptimize(rsqrt(x));
    benchmark::ClobberMemory();
  }
}

BENCHMARK(BM_inv_sqrt);
BENCHMARK(BM_rsqrt);

BENCHMARK_MAIN();

@zwd-ms
Copy link
Collaborator

zwd-ms commented Nov 30, 2023

from GCC doc: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options

-mrecip

This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -ffinite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).

Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.

Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request New feature requested in system
Projects
None yet
Development

No branches or pull requests

2 participants