You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
int main(int argc, char* argv[]) {
constexpr size_t num_repeats = 100000000;
volatile float in = 4.f;
volatile float out;
{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = inv_sqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();
{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = rsqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();
google benchmark should also show similar results, but it needs to prevent optimizations carefully.
however, in general we could just rely on compiler auto-vectorization to optimize this. Using -O3 -ffast-math -march=native can get much closer to the hand-written intrinsic performance.
#include<benchmark/benchmark.h>staticvoidBM_inv_sqrt(benchmark::State& state)
{
volatilefloat x = 4.f;
for (auto _ : state)
{
benchmark::DoNotOptimize(inv_sqrt(x));
benchmark::ClobberMemory();
}
}
staticvoidBM_rsqrt(benchmark::State& state)
{
volatilefloat x = 4.f;
for (auto _ : state)
{
benchmark::DoNotOptimize(rsqrt(x));
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_inv_sqrt);
BENCHMARK(BM_rsqrt);
BENCHMARK_MAIN();
This option enables use of RCPSS and RSQRTSS instructions (and their vectorized variants RCPPS and RSQRTPS) with an additional Newton-Raphson step to increase precision instead of DIVSS and SQRTSS (and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -ffinite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).
Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.
Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized sqrtf(x) already with -ffast-math (or the above option combination), and doesn’t need -mrecip.
1.f / std::sqrt(x) has become the default for inverse square root in gd.
Tracking issue to add flag to enable sse2 optimization here: #4666
Min repro:
#include <x86intrin.h>
#include
#include
#include
#include
inline float inv_sqrt(const float x) { return 1.f / std::sqrt(x); }
inline float rsqrt(const float f) {
__m128 temp = _mm_set_ss(f);
temp = _mm_rsqrt_ss(temp);
return _mm_cvtss_f32(temp);
}
int main(int argc, char* argv[]) {
constexpr size_t num_repeats = 100000000;
volatile float in = 4.f;
volatile float out;
{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = inv_sqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();
}
{
const auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i) {
out = rsqrt(in);
}
const auto stop = std::chrono::high_resolution_clock::now();
}
return EXIT_SUCCESS;
}
on my laptop, the results are:
zwd@zwd-msft:
$ g++ rsqrt.cc -O2$ ./a.outzwd@zwd-msft:
inv_sqrt: 209601 us
0.5
rsqrt: 34340 us
0.499878
Google benchmarks:
https://github.com/VowpalWabbit/vowpal_wabbit/actions/runs/6894788120/job/18757367848
The text was updated successfully, but these errors were encountered: