Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Double-precision conversion instructions #383

Merged
merged 1 commit into from
Feb 1, 2021

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Oct 13, 2020

Introduction

As discussed in #348, WebAssembly SIMD specification is functionally incomplete w.r.t double-precision computations as it doesn't cover many basic C/C++ operations, particularly conversions between double type and other types. Currently any code that use these operations would be de-vectorized (see examples in emmintrin.h header in Emscripten). Double-precision computations are extensively used in scientific computing, and providing a complete set of double-precision is a must to bring scientific computing applications to the Web. In the foreseeable future WebAssembly SIMD will be the only avenue for high-performance scientific computing in the Web, as neither WebGL nor WebGPU support double-precision computations.

This PR introduce the minimal set of conversion operations to fill this gap.

Applications

  • Conversions from double to int are used when storing data in quantized representation and in table-based algorithms (need to real indices to int index for table lookup).
  • Conversions from int to double are used when loading data in quantized representation and in random number generation (the raw output of RNGs is typically int, but numerically algorithms need real values).
  • Conversions between double and float are used in mixed-precision algorithms, where less sensitive parts of the application are computed in single precision and the more sensitive parts - in double precision. E.g. iterative algorithms may first run in single precision to convergence, and then starting with the single-precision solution run double precision iterations.

Use-cases in OpenCV

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F + AVX512VL instruction sets

  • f64x2.convert_low_i32x4_u
    • y = f64x2.convert_low_i32x4_u(x) is lowered to VCVTUDQ2PD xmm_y, xmm_x
  • i32x4.trunc_sat_f64x2_u_zero
    • y = i32x4.trunc_sat_f64x2_u_zero(x) (y is NOT x) is lowered to:
      • VXORPD xmm_y, xmm_y, xmm_y
      • VMAXPD xmm_y, xmm_x, xmm_y
      • VCVTTPD2UDQ xmm_y, xmm_y

x86/x86-64 processors with AVX instruction set

  • f64x2.convert_low_i32x4_s
    • y = f64x2.convert_low_i32x4_s(x) is lowered to VCVTDQ2PD xmm_y, xmm_x
  • f64x2.convert_low_i32x4_u
    • y = f64x2.convert_low_i32x4_u(x) is lowered to:
      • VUNPCKLPS xmm_y, xmm_x, [wasm_i32x4_splat(0x43300000)]
      • VSUBPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
  • i32x4.trunc_sat_f64x2_s_zero
    • y = i32x4.trunc_sat_f64x2_s_zero(x) (y is NOT x) is lowered to:
      • VCMPEQPD xmm_y, xmm_x, xmm_x
      • VANDPD xmm_y, [wasm_f64x2_splat(2147483647.0)]
      • VMINPD xmm_y, xmm_x, xmm_y
      • VCVTTPD2DQ xmm_y, xmm_y
  • i32x4.trunc_sat_f64x2_u_zero
    • y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
      • VXORPD xmm_tmp, xmm_tmp, xmm_tmp
      • VMAXPD xmm_y, xmm_x, xmm_tmp
      • VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
      • VROUNDPD xmm_y, xmm_y, 0x0B
      • VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
      • VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88
  • f32x4.demote_f64x2_zero
    • y = f32x4.demote_f64x2_zero(x) is lowered to VCVTPD2PS xmm_y, xmm_x
  • f64x2.promote_low_f32x4
    • y = f64x2.promote_low_f32x4(x) is lowered to VCVTPS2PD xmm_y, xmm_x

x86/x86-64 processors with SSE4.1 instruction set

  • i32x4.trunc_sat_f64x2_u_zero
    • y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
      • MOVAPD xmm_y, xmm_x
      • XORPD xmm_tmp, xmm_tmp
      • MAXPD xmm_y, xmm_tmp
      • MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
      • ROUNDPD xmm_y, xmm_y, 0x0B
      • ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
      • SHUFPS xmm_y, xmm_xmp, 0x88

x86/x86-64 processors with SSE2 instruction set

  • f64x2.convert_low_i32x4_s
    • y = f64x2.convert_low_i32x4_s(x) is lowered to CVTDQ2PD xmm_y, xmm_x
  • f64x2.convert_low_i32x4_u
    • y = f64x2.convert_low_i32x4_u(x) is lowered to:
      • MOVAPS xmm_y, xmm_x
      • UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
      • SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
  • i32x4.trunc_sat_f64x2_s_zero
    • y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
      • XORPS xmm_tmp, xmm_tmp
      • CMPEQPD xmm_tmp, xmm_x
      • MOVAPS xmm_y, xmm_x
      • ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
      • MINPD xmm_y, xmm_tmp
      • CVTTPD2DQ xmm_y, xmm_y
  • f32x4.demote_f64x2_zero
    • y = f32x4.demote_f64x2_zero(x) is lowered to CVTPD2PS xmm_y, xmm_x
  • f64x2.promote_low_f32x4
    • y = f64x2.promote_low_f32x4(x) is lowered to CVTPS2PD xmm_y, xmm_x

ARM64 processors

  • f64x2.convert_low_i32x4_s
    • y = f64x2.convert_low_i32x4_s(x) is lowered to:
      • SSHLL Vy.2D, Vx.2S, 0
      • SCVTF Vy.2D, Vy.2D
  • f64x2.convert_low_i32x4_u
    • y = f64x2.convert_low_i32x4_u(x) is lowered to:
      • USHLL Vy.2D, Vx.2S, 0
      • UCVTF Vy.2D, Vy.2D
  • i32x4.trunc_sat_f64x2_s_zero
    • y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
      • FCVTZS Vy.2D, Vx.2D
      • SQXTN Vy.2S, Vy.2D
  • i32x4.trunc_sat_f64x2_u_zero
    • y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
      • FCVTZU Vy.2D, Vx.2D
      • UQXTN Vy.2S, Vy.2D
  • f32x4.demote_f64x2_zero
    • y = f32x4.demote_f64x2_zero(x) is lowered to FCVTN Vy.2S, Vx.2D
  • f64x2.promote_low_f32x4
    • y = f64x2.promote_low_f32x4(x) is lowered to FCVTL Vy.2D, Vx.2S

ARMv7 processors with NEON

  • f64x2.convert_low_i32x4_s
    • y = f64x2.convert_low_i32x4_s(x) (x in q0-q7) is lowered to:
      • VCVT.F64.S32 Dy_lo, Sx_ll
      • VCVT.F64.S32 Dy_hi, Sx_lh
  • f64x2.convert_low_i32x4_u
    • y = f64x2.convert_low_i32x4_u(x) (x in q0-q7) is lowered to:
      • VCVT.F64.U32 Dy_lo, Sx_ll
      • VCVT.F64.U32 Dy_hi, Sx_lh
  • i32x4.trunc_sat_f64x2_s_zero
    • y = i32x4.trunc_sat_f64x2_s_zero(x) (y in q0-q7) is lowered to:
      • VCVT.S32.F64 Sy_ll, Dx_lo
      • VCVT.S32.F64 Sy_lh, Dx_lo
      • VMOV.I32 Dy_hi, 0
  • i32x4.trunc_sat_f64x2_u_zero
    • y = i32x4.trunc_sat_f64x2_u_zero(x) (y in q0-q7) is lowered to:
      • VCVT.U32.F64 Sy_ll, Dx_lo
      • VCVT.U32.F64 Sy_lh, Dx_lo
      • VMOV.I32 Dy_hi, 0
  • f32x4.demote_f64x2_zero
    • y = f32x4.demote_f64x2_zero(x) is lowered to:
      • VCVT.F32.F64 Sy_ll, Dx_lo
      • VCVT.F32.F64 Sy_lh, Dx_hi
      • VMOV.I32 Dy_hi, 0
  • f64x2.promote_low_f32x4
    • y = f64x2.promote_low_f32x4(x) is lowered to:
      • VCVT.F64.F32 Dy_lo, Sx_ll
      • VCVT.F64.F32 Dy_hi, Sx_lh

@ngzhian
Copy link
Member

ngzhian commented Dec 14, 2020

How does ARM codegen look like? Similar to ARM64? Or complicated enough that we will likely do runtime calls?

@Maratyszcza
Copy link
Contributor Author

Similar to ARM64, but with additional constraints on register allocation (only the first 8 NEON registers are accessible as S FPU registers).

@ngzhian
Copy link
Member

ngzhian commented Dec 23, 2020

Do you have more specific pointers to code that will benefit from this? And also plans for benchmarking if this is prototyped?
I would also like to hear from more people whether this is useful or not.

@Maratyszcza
Copy link
Contributor Author

These instructions are commonly used in scientific computing, but it is hard to provide code pointers as explicit vectorization in scientific computing codes is rare, they usually rely on compiler auto-vectorization of code in C/C++ and FORTRAN. My plan for evaluating these is to implement XorShift RNG with mapping of output to [0.0, 1.0]-range double-precision numbers.

@jan-wassenberg
Copy link

@Maratyszcza

implement XorShift RNG with mapping of output to [0.0, 1.0]-range double-precision numbers.

FYI there is code for the integer portion in https://gitlab.com/wg1/jpeg-xl/-/blob/master/lib/jxl/xorshift128plus-inl.h :)

@dtig
Copy link
Member

dtig commented Jan 5, 2021

Compiler auto-vectorization performance is somewhat nebulous, especially for cross platform performance for i64x2 operations. The last time we tried this IIRC we didn't see any benefit from autovectorizing. Perhaps one way to make a decision on many of these i64x2 operations is to evaluate a couple of sample scientific applications that might use these operations and see if the generated code would actually map to these operations.

@Maratyszcza (or anyone else) Are there any open source applications that we can evaluate in addition to the jpeg-xl link above?

@jan-wassenberg
Copy link

I'd like to express support for this. JPEG XL promotes floats to double and the Highway math library requires f64 -> i32 conversions.

@ngzhian
Copy link
Member

ngzhian commented Jan 7, 2021

Promote and demote maps to native nicely.
The others, specially f64x2->i32x4 looks bad. We shouldn't encourage the use of these.

@jan-wassenberg
Copy link

f64x2->i32x4 looks bad. We shouldn't encourage the use of these.

Can you help me understand your concern? Math libraries need floatN -> intM to do bitwise ops (shifting exponent into place for LoadExp). For N=64 (double), x86 only provides M=32. This works out OK for ARM as well. double->int64 would be much more expensive on x86. Do you see a better alternative?

@ngzhian
Copy link
Member

ngzhian commented Jan 7, 2021

I'm only pointing out that the codegen for i32x4.trunc_sat_f64x2_s_zero looks bad, it doesn't map well to native instruction at all.
I don't know what LoadExp is. For math libraries, would they choose to use this instruction given that it has such poor mapping? Or would they come up with workarounds for the specific use case?
The obvious alternative is to scalarize, it would be slower, but these SIMD instructions are slow too.

@Maratyszcza
Copy link
Contributor Author

i32x4.trunc_sat_f64x2_s_zero lowers to multiple instructions on x86 because it has to simulate WAsm semantics for out-of-bounds values. Scalar instructions would have to simulate it too, one lane at a time.

tlively added a commit to tlively/binaryen that referenced this pull request Jan 19, 2021
As proposed in WebAssembly/simd#383, with opcodes
coordinated with the WIP V8 prototype.
tlively added a commit to WebAssembly/binaryen that referenced this pull request Jan 20, 2021
As proposed in WebAssembly/simd#383, with opcodes
coordinated with the WIP V8 prototype.
tlively added a commit to llvm/llvm-project that referenced this pull request Jan 20, 2021
@ngzhian
Copy link
Member

ngzhian commented Jan 20, 2021

For i32x4.trunc_sat_f64x2_s_zero SSE2 lowering, I think the first instruction is wrong:

XORPS xmm_tmp, xmm_tmp
CMPEQPD xmm_tmp, xmm_x
...

This is a compare with 0, should it be

MOVAPS xmm_tmp, xmm_x
CMPEQPD xmm_tmp, xmm_tmp
...

?

@ngzhian
Copy link
Member

ngzhian commented Jan 20, 2021

Any suggested lowering for SSE2 i32x4.trunc_sat_f64x2_u_zero, translate the AVX to non-AVX?

@dtig
Copy link
Member

dtig commented Jan 25, 2021

Adding a preliminary vote for the inclusion of double-precision conversion operations to the SIMD proposal below. Please vote with -

👍 For including double-precision conversion operations
👎 Against including double-precision conversion operations

@penzn
Copy link
Contributor

penzn commented Jan 25, 2021

I don't think using a poor SIMD sequence instead of a poor scalar sequence is really a solution, plus we have no performance data on this. I don't think this instruction is ready to be included.

@Maratyszcza
Copy link
Contributor Author

@penzn I'm working on performance data. Expect it later today.

pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Jan 25, 2021
Prototype these 6 instructions on x64:

- f64x2.convert_low_i32x4_s
- f64x2.convert_low_i32x4_u
- i32x4.trunc_sat_f64x2_s_zero
- i32x4.trunc_sat_f64x2_u_zero
- f32x4.demote_f64x2_zero
- f64x2.promote_low_f32x4

Some of these code sequences make use of special masks, we keep them in
external references.

Code sequence based on suggestions at:
WebAssembly/simd#383

Bug: v8:11265
Change-Id: Ied67d7b5b6beaaccac7c179ec13504482cb9c915
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2643562
Reviewed-by: Deepti Gandluri <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
Commit-Queue: Zhi An Ng <[email protected]>
Cr-Commit-Position: refs/heads/master@{#72297}
@ngzhian ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021
@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Jan 26, 2021

To evaluate this proposal I implemented SIMD and semi-SIMD (mostly SIMD, but using scalar conversions) versions of the mixed-precision dot product operating where inputs are single-precision floats, but accumulators and outputs are double-precision.

Here's the SIMD version:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    v128_t vx = wasm_v128_load(x);
    x += 4;

    v128_t vy = wasm_v128_load(y);
    y += 4;

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        __builtin_wasm_promote_low_f32x4_f64x2(vx),
        __builtin_wasm_promote_low_f32x4_f64x2(vy)));
    vx = wasm_v32x4_shuffle(vx, vx, 2, 3, 0, 1);
    vy = wasm_v32x4_shuffle(vy, vy, 2, 3, 0, 1);
    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        __builtin_wasm_promote_low_f32x4_f64x2(vx),
        __builtin_wasm_promote_low_f32x4_f64x2(vy)));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

And here's the semi-SIMD version:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    const v128_t vx = wasm_v128_load(x);
    x += 4;

    const v128_t vy = wasm_v128_load(y);
    y += 4;

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vx, 0),
          (double) wasm_f32x4_extract_lane(vx, 1)),
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vy, 0),
          (double) wasm_f32x4_extract_lane(vy, 1))));
    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vx, 2),
          (double) wasm_f32x4_extract_lane(vx, 3)),
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vy, 2),
          (double) wasm_f32x4_extract_lane(vy, 3))));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

So far V8 implements the double-precision conversion instructions only on x86-64 (ARM64 implementation is in review), and the results on x86-64 systems are presented below:

Implementation  Performance, scalar conversions Performance, SIMD conversions Speedup
AMD PRO A10-8700B 4.55 GB/s 9.65 GB/s 2.1X
AMD A4-7210 1.37 GB/s 4.94 GB/s 3.6X
Intel Xeon W-2135 5.01 GB/s 21.6 GB/s 4.3X
Intel Celeron N3060 1.29 GB/s 4.03 GB/s 3.1X

@penzn
Copy link
Contributor

penzn commented Jan 26, 2021

The version using conversion looks great, but for the semi-SIMD version would not it be faster to construct f64x2 from individual array elements (instead of first loading an array chunk into a f32x4 and then breaking it down and then recreating as f64x2)?

@Maratyszcza
Copy link
Contributor Author

@penzn I tried a version with scalar loads:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    const float vx0 = x[0];
    const float vx1 = x[1];
    const float vy0 = y[0];
    const float vy1 = y[1];

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        wasm_f64x2_make((double) vx0, (double) vx1),
        wasm_f64x2_make((double) vy0, (double) vy1)));

    const float vx2 = x[2];
    const float vx3 = x[3];
    x += 4;

    const float vy2 = y[2];
    const float vy3 = y[3];
    y += 4;

    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        wasm_f64x2_make((double) vx2, (double) vx3),
        wasm_f64x2_make((double) vy2, (double) vy3)));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

Performance is presented below:

Implementation  Performance, scalar loads & conversions Performance, SIMD conversions Speedup
AMD PRO A10-8700B 4.77 GB/s 9.65 GB/s 2.0X
AMD A4-7210 1.52 GB/s 4.94 GB/s 3.3X
Intel Xeon W-2135 6.52 GB/s 21.6 GB/s 3.3X
Intel Celeron N3060 1.31 GB/s 4.03 GB/s 3.1X

@ngzhian
Copy link
Member

ngzhian commented Jan 26, 2021

Some good speedups with f64x2.promote_low_f32x4. We also have a use case in OpenCV. The lowering is fantastic as well, so I am supportive of this instruction.
However, this PR is a set of 6 different instructions. Some of them look very different in terms of lowering, and don't have use cases listed.
Will it be useful to split this instruction out (maybe promote and demote)?

@jan-wassenberg
Copy link

jan-wassenberg commented Jan 29, 2021

For later relaxed semantics of floating-point to integer, I think it would suffice to have, on x86:

truncated=cvttps
return truncated ^ BroadcastSign(AndNot(original, truncated))

(We're fixing up 0x80..00 to 0x7F..FF if the sign changed)

@Maratyszcza
Copy link
Contributor Author

For relaxed semantics my preference would be to directly map to CVTTPD2DQ. In relaxed semantics the out-of-bounds behavior doesn't need to be the same across architectures.

@ngzhian
Copy link
Member

ngzhian commented Jan 29, 2021

At the risk of making you do more work @Maratyszcza, what do you think about splitting up promote/demote from this PR? From the meeting earlier, I got the sense that most of us feel pretty comfortable about promote/demote. We can probably get that in without much discussion.

f64x2.convert_i32x4_{s,u} maybe should be split out as well, these require less discussion than truncsat (better lowering), and also already have use cases and benchmarks (at least the signed onces).

Having trunc_sat out on its own will let us have a more focused discussion. The risk here I guess is orthogonality. But grouping all these instructions together risk us losing some useful instructions because of objections to poor lowering of others.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Jan 29, 2021

@ngzhian These instructions were voted on together, so I'd rather have them merged together. f64x2.convert_i32x4_{s,u} don't really have an alternative, aside from scalarization which has the same lowering issues, and if we have these instructions in the spec we can at least fix inefficient lowerings in Fast SIMD.

Besides, I evaluated performance on this code for evaluation of log2(x) which is reused across several GitHub projects. Here's the full-SIMD version (with SIMD conversions):

void remez5_0_log2_simd(const double *inputs, double* outputs, size_t num) {
  assert(num != 0);
  assert(num % 4 == 0);

  const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);

  do {
    v128_t k1, k2;
    v128_t p1, p2;
    v128_t a1, a2;
    v128_t xmm0, xmm1;
    v128_t x1, x2;

    /* Load four double values. */
    xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
    xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
    x1 = wasm_v128_load(inputs);
    x2 = wasm_v128_load(inputs+2);
    inputs += 4;

    x1 = wasm_f64x2_pmin(xmm0, x1);
    x2 = wasm_f64x2_pmin(xmm0, x2);
    x1 = wasm_f64x2_pmax(xmm1, x1);
    x2 = wasm_f64x2_pmax(xmm1, x2);

    /* a = x / log2; */
    xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
    xmm1 = wasm_f64x2_const(0.0, 0.0);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);

    /* k = (int)floor(a); p = (float)k; */
    p1 = wasm_f64x2_lt(a1, xmm1);
    p2 = wasm_f64x2_lt(a2, xmm1);
    xmm0 = wasm_f64x2_const(1.0, 1.0);
    p1 = wasm_v128_and(p1, xmm0);
    p2 = wasm_v128_and(p2, xmm0);
    a1 = wasm_f64x2_sub(a1, p1);
    a2 = wasm_f64x2_sub(a2, p2);
    k1 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a1);
    k2 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a2);
    p1 = __builtin_wasm_convert_low_s_i32x4_f64x2(k1);
    p2 = __builtin_wasm_convert_low_s_i32x4_f64x2(k2);

    /* x -= p * log2; */
    xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
    xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
    a1 = wasm_f64x2_mul(p1, xmm0);
    a2 = wasm_f64x2_mul(p2, xmm0);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);
    a1 = wasm_f64x2_mul(p1, xmm1);
    a2 = wasm_f64x2_mul(p2, xmm1);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);

    /* Compute e^x using a polynomial approximation. */
    xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
    xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
    xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
    xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    /* p = 2^k; */
    k1 = wasm_i32x4_add(k1, offset);
    k2 = wasm_i32x4_add(k2, offset);
    k1 = wasm_i32x4_shl(k1, 20);
    k2 = wasm_i32x4_shl(k2, 20);
    k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
    k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
    p1 = k1;
    p2 = k2;

    /* a *= 2^k. */
    a1 = wasm_f64x2_mul(a1, p1);
    a2 = wasm_f64x2_mul(a2, p2);

    /* Store the results. */
    wasm_v128_store(outputs, a1);
    wasm_v128_store(outputs+2, a2);
    outputs += 4;

    num -= 4;
  } while (num != 0);
}

Here's the semi-SIMD version (scalar conversions, rest is SIMD):

void remez5_0_log2_semisimd(const double *inputs, double* outputs, size_t num) {
  assert(num != 0);
  assert(num % 4 == 0);

  const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);

  do {
    v128_t k1, k2;
    v128_t p1, p2;
    v128_t a1, a2;
    v128_t xmm0, xmm1;
    v128_t x1, x2;

    /* Load four double values. */
    xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
    xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
    x1 = wasm_v128_load(inputs);
    x2 = wasm_v128_load(inputs+2);
    inputs += 4;

    x1 = wasm_f64x2_pmin(xmm0, x1);
    x2 = wasm_f64x2_pmin(xmm0, x2);
    x1 = wasm_f64x2_pmax(xmm1, x1);
    x2 = wasm_f64x2_pmax(xmm1, x2);

    /* a = x / log2; */
    xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
    xmm1 = wasm_f64x2_const(0.0, 0.0);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);

    /* k = (int)floor(a); p = (float)k; */
    p1 = wasm_f64x2_lt(a1, xmm1);
    p2 = wasm_f64x2_lt(a2, xmm1);
    xmm0 = wasm_f64x2_const(1.0, 1.0);
    p1 = wasm_v128_and(p1, xmm0);
    p2 = wasm_v128_and(p2, xmm0);
    a1 = wasm_f64x2_sub(a1, p1);
    a2 = wasm_f64x2_sub(a2, p2);
    const int k1_lo = (int) wasm_f64x2_extract_lane(a1, 0);
    const int k1_hi = (int) wasm_f64x2_extract_lane(a1, 1);
    const int k2_lo = (int) wasm_f64x2_extract_lane(a2, 0);
    const int k2_hi = (int) wasm_f64x2_extract_lane(a2, 1);
    k1 = wasm_i32x4_make(k1_lo, k1_hi, 0, 0);
    k2 = wasm_i32x4_make(k2_lo, k2_hi, 0, 0);
    p1 = wasm_f64x2_make((double) k1_lo, (double) k1_hi);
    p2 = wasm_f64x2_make((double) k2_lo, (double) k2_hi);

    /* x -= p * log2; */
    xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
    xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
    a1 = wasm_f64x2_mul(p1, xmm0);
    a2 = wasm_f64x2_mul(p2, xmm0);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);
    a1 = wasm_f64x2_mul(p1, xmm1);
    a2 = wasm_f64x2_mul(p2, xmm1);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);

    /* Compute e^x using a polynomial approximation. */
    xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
    xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
    xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
    xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    /* p = 2^k; */
    k1 = wasm_i32x4_add(k1, offset);
    k2 = wasm_i32x4_add(k2, offset);
    k1 = wasm_i32x4_shl(k1, 20);
    k2 = wasm_i32x4_shl(k2, 20);
    k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
    k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
    p1 = k1;
    p2 = k2;

    /* a *= 2^k. */
    a1 = wasm_f64x2_mul(a1, p1);
    a2 = wasm_f64x2_mul(a2, p2);

    /* Store the results. */
    wasm_v128_store(outputs, a1);
    wasm_v128_store(outputs+2, a2);
    outputs += 4;

    num -= 4;
  } while (num != 0);
}

Here're the performance results:

Processor (Device)  Performance, scalar conversions Performance, SIMD conversions Speedup
AMD PRO A10-8700B 604 MB/s 1428 MB/s 2.4X
AMD A4-7210 76.4 MB/s 89.7 MB/s 1.2X
Intel Xeon W-2135 291 MB/s 355 MB/s 1.2X
Intel Celeron N3060 102 MB/s 130 MB/s 1.3X
Qualcomm Snapdragon 670 (Pixel 3a) 414 MB/s 640 MB/s 1.5X
Samsung Exynos 8895 (Galaxy S8) 389 GB/s 487 GB/s 2.3X

@dtig
Copy link
Member

dtig commented Feb 1, 2021

Merging as per #429 and discussion above.

@dtig dtig merged commit dae9f6c into WebAssembly:master Feb 1, 2021
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 12, 2021
This converts 2 f64 to 2 i32, then zeroes the top 2 lanes.

These 2 instructions were merged as part of WebAssembly#383.

This change also refactors some test cases from simd_conversions out
into a script that generates both i32x4.trunc_sat_i32x4_{s,u} and
i32x4.trunc_sat_f64x2_{s,u}_zero.
ngzhian added a commit that referenced this pull request Feb 17, 2021
* [interpreter] Implement i32x4.trunc_sat_f64x2_{s,u}_zero

This converts 2 f64 to 2 i32, then zeroes the top 2 lanes.

These 2 instructions were merged as part of #383.

This change also refactors some test cases from simd_conversions out
into a script that generates both i32x4.trunc_sat_i32x4_{s,u} and
i32x4.trunc_sat_f64x2_{s,u}_zero.

* Add encode/decode

* Update interpreter/exec/simd.ml

Co-authored-by: Andreas Rossberg <[email protected]>

Co-authored-by: Andreas Rossberg <[email protected]>
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 17, 2021
These 2 instructions were merged as part of WebAssembly#383.

The test cases are add manually to simd_conversions.wast, using
constants from test/core/conversions.wast.
ngzhian added a commit that referenced this pull request Feb 18, 2021
These 2 instructions were merged as part of #383.

The test cases are add manually to simd_conversions.wast, using
constants from test/core/conversions.wast.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 18, 2021
These 2 instructions were merged as part of WebAssembly#383.
ngzhian added a commit that referenced this pull request Feb 23, 2021
These 2 instructions were merged as part of #383.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 23, 2021
Instructions were added in WebAssembly#383.

Consolidate conversion operations (vcvtop) more, merging int-int
widening operations.

Drive-by fix extmul definition in syntax (shouldn't include the shape).
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 24, 2021
Instructions were added in WebAssembly#383.

Consolidate conversion operations (vcvtop) more, merging int-int
widening operations.

Drive-by fix extmul definition in syntax (shouldn't include the shape).
ngzhian added a commit that referenced this pull request Feb 24, 2021
Instructions were added in #383.

Consolidate conversion operations (vcvtop) more, merging int-int
widening operations.

Drive-by fix extmul definition in syntax (shouldn't include the shape).
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 30, 2021
@jlb6740
Copy link

jlb6740 commented Jun 10, 2021

@Maratyszcza @ngzhian Hi .. I am lowering f64x2.convert_low_i32x4_u for sse2 in a backend. I see this conversion here:

f64x2.convert_low_i32x4_u
y = f64x2.convert_low_i32x4_u(x) is lowered to:
MOVAPS xmm_y, xmm_x
UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]

I am trying to understand the magic here. Do you a reference where there is more detail or have an explaination. I checked out v8's implementation here: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=2406;drc=92a46414765f4873251cf9ecefbef130648e2af8 where it is mentioned that

 // dst = [ src_low, 0x43300000, src_high, 0x4330000 ];
 // 0x43300000'00000000 is a special double where the significand bits
 // precisely represents all uint32 numbers.

But I am not really sure what this implies as far as how this conversion works. Any comments?

@Maratyszcza
Copy link
Contributor Author

@jbl 0x43300000 is the high 32 bits of 0x1.0p+52. ULP(0x1.0p+52) == 1.0, so the whole number after unpack is 0x1.0p+52 + double(x)

@jlb6740
Copy link

jlb6740 commented Jun 14, 2021

@Maratyszcza Hi, thanks so much. I got it now! The "magic" I was missing was is in the subtract.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
2021-01-29 Agenda for sync meeting 1/29/21
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants