Double-precision conversion instructions #383

Maratyszcza · 2020-10-13T20:53:29Z

Introduction

As discussed in #348, WebAssembly SIMD specification is functionally incomplete w.r.t double-precision computations as it doesn't cover many basic C/C++ operations, particularly conversions between double type and other types. Currently any code that use these operations would be de-vectorized (see examples in emmintrin.h header in Emscripten). Double-precision computations are extensively used in scientific computing, and providing a complete set of double-precision is a must to bring scientific computing applications to the Web. In the foreseeable future WebAssembly SIMD will be the only avenue for high-performance scientific computing in the Web, as neither WebGL nor WebGPU support double-precision computations.

This PR introduce the minimal set of conversion operations to fill this gap.

Applications

Conversions from double to int are used when storing data in quantized representation and in table-based algorithms (need to real indices to int index for table lookup).
Conversions from int to double are used when loading data in quantized representation and in random number generation (the raw output of RNGs is typically int, but numerically algorithms need real values).
Conversions between double and float are used in mixed-precision algorithms, where less sensitive parts of the application are computed in single precision and the more sensitive parts - in double precision. E.g. iterative algorithms may first run in single precision to convergence, and then starting with the single-precision solution run double precision iterations.

Use-cases in OpenCV

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F + AVX512VL instruction sets

f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to VCVTUDQ2PD xmm_y, xmm_x
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) (y is NOT x) is lowered to:
  - VXORPD xmm_y, xmm_y, xmm_y
  - VMAXPD xmm_y, xmm_x, xmm_y
  - VCVTTPD2UDQ xmm_y, xmm_y

x86/x86-64 processors with AVX instruction set

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) is lowered to VCVTDQ2PD xmm_y, xmm_x
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to:
  - VUNPCKLPS xmm_y, xmm_x, [wasm_i32x4_splat(0x43300000)]
  - VSUBPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) (y is NOT x) is lowered to:
  - VCMPEQPD xmm_y, xmm_x, xmm_x
  - VANDPD xmm_y, [wasm_f64x2_splat(2147483647.0)]
  - VMINPD xmm_y, xmm_x, xmm_y
  - VCVTTPD2DQ xmm_y, xmm_y
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
  - VXORPD xmm_tmp, xmm_tmp, xmm_tmp
  - VMAXPD xmm_y, xmm_x, xmm_tmp
  - VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
  - VROUNDPD xmm_y, xmm_y, 0x0B
  - VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
  - VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to VCVTPD2PS xmm_y, xmm_x
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to VCVTPS2PD xmm_y, xmm_x

x86/x86-64 processors with SSE4.1 instruction set

i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
  - MOVAPD xmm_y, xmm_x
  - XORPD xmm_tmp, xmm_tmp
  - MAXPD xmm_y, xmm_tmp
  - MINPD xmm_y, [wasm_f64x2_splat(4294967295.0)]
  - ROUNDPD xmm_y, xmm_y, 0x0B
  - ADDPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
  - SHUFPS xmm_y, xmm_xmp, 0x88

x86/x86-64 processors with SSE2 instruction set

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) is lowered to CVTDQ2PD xmm_y, xmm_x
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to:
  - MOVAPS xmm_y, xmm_x
  - UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
  - SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
  - XORPS xmm_tmp, xmm_tmp
  - CMPEQPD xmm_tmp, xmm_x
  - MOVAPS xmm_y, xmm_x
  - ANDPS xmm_tmp, [wasm_f64x2_splat(2147483647.0)]
  - MINPD xmm_y, xmm_tmp
  - CVTTPD2DQ xmm_y, xmm_y
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to CVTPD2PS xmm_y, xmm_x
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to CVTPS2PD xmm_y, xmm_x

ARM64 processors

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) is lowered to:
  - SSHLL Vy.2D, Vx.2S, 0
  - SCVTF Vy.2D, Vy.2D
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) is lowered to:
  - USHLL Vy.2D, Vx.2S, 0
  - UCVTF Vy.2D, Vy.2D
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) is lowered to:
  - FCVTZS Vy.2D, Vx.2D
  - SQXTN Vy.2S, Vy.2D
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) is lowered to:
  - FCVTZU Vy.2D, Vx.2D
  - UQXTN Vy.2S, Vy.2D
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to FCVTN Vy.2S, Vx.2D
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to FCVTL Vy.2D, Vx.2S

ARMv7 processors with NEON

f64x2.convert_low_i32x4_s
- y = f64x2.convert_low_i32x4_s(x) (x in q0-q7) is lowered to:
  - VCVT.F64.S32 Dy_lo, Sx_ll
  - VCVT.F64.S32 Dy_hi, Sx_lh
f64x2.convert_low_i32x4_u
- y = f64x2.convert_low_i32x4_u(x) (x in q0-q7) is lowered to:
  - VCVT.F64.U32 Dy_lo, Sx_ll
  - VCVT.F64.U32 Dy_hi, Sx_lh
i32x4.trunc_sat_f64x2_s_zero
- y = i32x4.trunc_sat_f64x2_s_zero(x) (y in q0-q7) is lowered to:
  - VCVT.S32.F64 Sy_ll, Dx_lo
  - VCVT.S32.F64 Sy_lh, Dx_lo
  - VMOV.I32 Dy_hi, 0
i32x4.trunc_sat_f64x2_u_zero
- y = i32x4.trunc_sat_f64x2_u_zero(x) (y in q0-q7) is lowered to:
  - VCVT.U32.F64 Sy_ll, Dx_lo
  - VCVT.U32.F64 Sy_lh, Dx_lo
  - VMOV.I32 Dy_hi, 0
f32x4.demote_f64x2_zero
- y = f32x4.demote_f64x2_zero(x) is lowered to:
  - VCVT.F32.F64 Sy_ll, Dx_lo
  - VCVT.F32.F64 Sy_lh, Dx_hi
  - VMOV.I32 Dy_hi, 0
f64x2.promote_low_f32x4
- y = f64x2.promote_low_f32x4(x) is lowered to:
  - VCVT.F64.F32 Dy_lo, Sx_ll
  - VCVT.F64.F32 Dy_hi, Sx_lh

ngzhian · 2020-12-14T23:41:22Z

How does ARM codegen look like? Similar to ARM64? Or complicated enough that we will likely do runtime calls?

Maratyszcza · 2020-12-15T00:33:38Z

Similar to ARM64, but with additional constraints on register allocation (only the first 8 NEON registers are accessible as S FPU registers).

ngzhian · 2020-12-23T08:48:35Z

Do you have more specific pointers to code that will benefit from this? And also plans for benchmarking if this is prototyped?
I would also like to hear from more people whether this is useful or not.

Maratyszcza · 2020-12-23T19:28:17Z

These instructions are commonly used in scientific computing, but it is hard to provide code pointers as explicit vectorization in scientific computing codes is rare, they usually rely on compiler auto-vectorization of code in C/C++ and FORTRAN. My plan for evaluating these is to implement XorShift RNG with mapping of output to [0.0, 1.0]-range double-precision numbers.

jan-wassenberg · 2020-12-24T07:49:04Z

@Maratyszcza

implement XorShift RNG with mapping of output to [0.0, 1.0]-range double-precision numbers.

FYI there is code for the integer portion in https://gitlab.com/wg1/jpeg-xl/-/blob/master/lib/jxl/xorshift128plus-inl.h :)

dtig · 2021-01-05T20:48:23Z

Compiler auto-vectorization performance is somewhat nebulous, especially for cross platform performance for i64x2 operations. The last time we tried this IIRC we didn't see any benefit from autovectorizing. Perhaps one way to make a decision on many of these i64x2 operations is to evaluate a couple of sample scientific applications that might use these operations and see if the generated code would actually map to these operations.

@Maratyszcza (or anyone else) Are there any open source applications that we can evaluate in addition to the jpeg-xl link above?

jan-wassenberg · 2021-01-06T10:11:52Z

I'd like to express support for this. JPEG XL promotes floats to double and the Highway math library requires f64 -> i32 conversions.

ngzhian · 2021-01-07T01:23:18Z

Promote and demote maps to native nicely.
The others, specially f64x2->i32x4 looks bad. We shouldn't encourage the use of these.

jan-wassenberg · 2021-01-07T07:44:57Z

f64x2->i32x4 looks bad. We shouldn't encourage the use of these.

Can you help me understand your concern? Math libraries need floatN -> intM to do bitwise ops (shifting exponent into place for LoadExp). For N=64 (double), x86 only provides M=32. This works out OK for ARM as well. double->int64 would be much more expensive on x86. Do you see a better alternative?

ngzhian · 2021-01-07T07:58:17Z

I'm only pointing out that the codegen for i32x4.trunc_sat_f64x2_s_zero looks bad, it doesn't map well to native instruction at all.
I don't know what LoadExp is. For math libraries, would they choose to use this instruction given that it has such poor mapping? Or would they come up with workarounds for the specific use case?
The obvious alternative is to scalarize, it would be slower, but these SIMD instructions are slow too.

Maratyszcza · 2021-01-07T09:17:40Z

i32x4.trunc_sat_f64x2_s_zero lowers to multiple instructions on x86 because it has to simulate WAsm semantics for out-of-bounds values. Scalar instructions would have to simulate it too, one lane at a time.

As proposed in WebAssembly/simd#383, with opcodes coordinated with the WIP V8 prototype.

As proposed in WebAssembly/simd#383. Differential Revision: https://reviews.llvm.org/D95012

ngzhian · 2021-01-20T21:31:39Z

For i32x4.trunc_sat_f64x2_s_zero SSE2 lowering, I think the first instruction is wrong:

XORPS xmm_tmp, xmm_tmp
CMPEQPD xmm_tmp, xmm_x
...

This is a compare with 0, should it be

MOVAPS xmm_tmp, xmm_x
CMPEQPD xmm_tmp, xmm_tmp
...

?

ngzhian · 2021-01-20T21:32:16Z

Any suggested lowering for SSE2 i32x4.trunc_sat_f64x2_u_zero, translate the AVX to non-AVX?

dtig · 2021-01-25T19:31:55Z

Adding a preliminary vote for the inclusion of double-precision conversion operations to the SIMD proposal below. Please vote with -

👍 For including double-precision conversion operations
👎 Against including double-precision conversion operations

penzn · 2021-01-25T23:17:27Z

I don't think using a poor SIMD sequence instead of a poor scalar sequence is really a solution, plus we have no performance data on this. I don't think this instruction is ready to be included.

Maratyszcza · 2021-01-25T23:25:44Z

@penzn I'm working on performance data. Expect it later today.

Prototype these 6 instructions on x64: - f64x2.convert_low_i32x4_s - f64x2.convert_low_i32x4_u - i32x4.trunc_sat_f64x2_s_zero - i32x4.trunc_sat_f64x2_u_zero - f32x4.demote_f64x2_zero - f64x2.promote_low_f32x4 Some of these code sequences make use of special masks, we keep them in external references. Code sequence based on suggestions at: WebAssembly/simd#383 Bug: v8:11265 Change-Id: Ied67d7b5b6beaaccac7c179ec13504482cb9c915 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2643562 Reviewed-by: Deepti Gandluri <[email protected]> Reviewed-by: Georg Neis <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#72297}

Maratyszcza · 2021-01-26T07:30:10Z

To evaluate this proposal I implemented SIMD and semi-SIMD (mostly SIMD, but using scalar conversions) versions of the mixed-precision dot product operating where inputs are single-precision floats, but accumulators and outputs are double-precision.

Here's the SIMD version:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    v128_t vx = wasm_v128_load(x);
    x += 4;

    v128_t vy = wasm_v128_load(y);
    y += 4;

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        __builtin_wasm_promote_low_f32x4_f64x2(vx),
        __builtin_wasm_promote_low_f32x4_f64x2(vy)));
    vx = wasm_v32x4_shuffle(vx, vx, 2, 3, 0, 1);
    vy = wasm_v32x4_shuffle(vy, vy, 2, 3, 0, 1);
    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        __builtin_wasm_promote_low_f32x4_f64x2(vx),
        __builtin_wasm_promote_low_f32x4_f64x2(vy)));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

And here's the semi-SIMD version:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    const v128_t vx = wasm_v128_load(x);
    x += 4;

    const v128_t vy = wasm_v128_load(y);
    y += 4;

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vx, 0),
          (double) wasm_f32x4_extract_lane(vx, 1)),
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vy, 0),
          (double) wasm_f32x4_extract_lane(vy, 1))));
    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vx, 2),
          (double) wasm_f32x4_extract_lane(vx, 3)),
        wasm_f64x2_make(
          (double) wasm_f32x4_extract_lane(vy, 2),
          (double) wasm_f32x4_extract_lane(vy, 3))));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

So far V8 implements the double-precision conversion instructions only on x86-64 (ARM64 implementation is in review), and the results on x86-64 systems are presented below:

Implementation	Performance, scalar conversions	Performance, SIMD conversions	Speedup
AMD PRO A10-8700B	4.55 GB/s	9.65 GB/s	2.1X
AMD A4-7210	1.37 GB/s	4.94 GB/s	3.6X
Intel Xeon W-2135	5.01 GB/s	21.6 GB/s	4.3X
Intel Celeron N3060	1.29 GB/s	4.03 GB/s	3.1X

penzn · 2021-01-26T17:22:04Z

The version using conversion looks great, but for the semi-SIMD version would not it be faster to construct f64x2 from individual array elements (instead of first loading an array chunk into a f32x4 and then breaking it down and then recreating as f64x2)?

Maratyszcza · 2021-01-26T19:00:52Z

@penzn I tried a version with scalar loads:

double DotProd(size_t n, const float* x, const float* y) {
  assert(n != 0);
  assert(n % 4 == 0);

  v128_t vacc_lo = wasm_f64x2_const(0.0, 0.0);
  v128_t vacc_hi = wasm_f64x2_const(0.0, 0.0);
  do {
    const float vx0 = x[0];
    const float vx1 = x[1];
    const float vy0 = y[0];
    const float vy1 = y[1];

    vacc_lo = wasm_f64x2_add(vacc_lo,
      wasm_f64x2_mul(
        wasm_f64x2_make((double) vx0, (double) vx1),
        wasm_f64x2_make((double) vy0, (double) vy1)));

    const float vx2 = x[2];
    const float vx3 = x[3];
    x += 4;

    const float vy2 = y[2];
    const float vy3 = y[3];
    y += 4;

    vacc_hi = wasm_f64x2_add(vacc_hi,
      wasm_f64x2_mul(
        wasm_f64x2_make((double) vx2, (double) vx3),
        wasm_f64x2_make((double) vy2, (double) vy3)));

    n -= 4;
  } while (n != 0);
  const v128_t vacc = wasm_f64x2_add(vacc_lo, vacc_hi);
  return wasm_f64x2_extract_lane(vacc, 0) + wasm_f64x2_extract_lane(vacc, 1);
}

Performance is presented below:

Implementation	Performance, scalar loads & conversions	Performance, SIMD conversions	Speedup
AMD PRO A10-8700B	4.77 GB/s	9.65 GB/s	2.0X
AMD A4-7210	1.52 GB/s	4.94 GB/s	3.3X
Intel Xeon W-2135	6.52 GB/s	21.6 GB/s	3.3X
Intel Celeron N3060	1.31 GB/s	4.03 GB/s	3.1X

ngzhian · 2021-01-26T22:35:00Z

Some good speedups with f64x2.promote_low_f32x4. We also have a use case in OpenCV. The lowering is fantastic as well, so I am supportive of this instruction.
However, this PR is a set of 6 different instructions. Some of them look very different in terms of lowering, and don't have use cases listed.
Will it be useful to split this instruction out (maybe promote and demote)?

jan-wassenberg · 2021-01-29T17:53:41Z

For later relaxed semantics of floating-point to integer, I think it would suffice to have, on x86:

truncated=cvttps
return truncated ^ BroadcastSign(AndNot(original, truncated))

(We're fixing up 0x80..00 to 0x7F..FF if the sign changed)

Maratyszcza · 2021-01-29T18:16:22Z

For relaxed semantics my preference would be to directly map to CVTTPD2DQ. In relaxed semantics the out-of-bounds behavior doesn't need to be the same across architectures.

ngzhian · 2021-01-29T18:46:47Z

At the risk of making you do more work @Maratyszcza, what do you think about splitting up promote/demote from this PR? From the meeting earlier, I got the sense that most of us feel pretty comfortable about promote/demote. We can probably get that in without much discussion.

f64x2.convert_i32x4_{s,u} maybe should be split out as well, these require less discussion than truncsat (better lowering), and also already have use cases and benchmarks (at least the signed onces).

Having trunc_sat out on its own will let us have a more focused discussion. The risk here I guess is orthogonality. But grouping all these instructions together risk us losing some useful instructions because of objections to poor lowering of others.

Maratyszcza · 2021-01-29T21:59:12Z

@ngzhian These instructions were voted on together, so I'd rather have them merged together. f64x2.convert_i32x4_{s,u} don't really have an alternative, aside from scalarization which has the same lowering issues, and if we have these instructions in the spec we can at least fix inefficient lowerings in Fast SIMD.

Besides, I evaluated performance on this code for evaluation of log2(x) which is reused across several GitHub projects. Here's the full-SIMD version (with SIMD conversions):

void remez5_0_log2_simd(const double *inputs, double* outputs, size_t num) {
  assert(num != 0);
  assert(num % 4 == 0);

  const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);

  do {
    v128_t k1, k2;
    v128_t p1, p2;
    v128_t a1, a2;
    v128_t xmm0, xmm1;
    v128_t x1, x2;

    /* Load four double values. */
    xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
    xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
    x1 = wasm_v128_load(inputs);
    x2 = wasm_v128_load(inputs+2);
    inputs += 4;

    x1 = wasm_f64x2_pmin(xmm0, x1);
    x2 = wasm_f64x2_pmin(xmm0, x2);
    x1 = wasm_f64x2_pmax(xmm1, x1);
    x2 = wasm_f64x2_pmax(xmm1, x2);

    /* a = x / log2; */
    xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
    xmm1 = wasm_f64x2_const(0.0, 0.0);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);

    /* k = (int)floor(a); p = (float)k; */
    p1 = wasm_f64x2_lt(a1, xmm1);
    p2 = wasm_f64x2_lt(a2, xmm1);
    xmm0 = wasm_f64x2_const(1.0, 1.0);
    p1 = wasm_v128_and(p1, xmm0);
    p2 = wasm_v128_and(p2, xmm0);
    a1 = wasm_f64x2_sub(a1, p1);
    a2 = wasm_f64x2_sub(a2, p2);
    k1 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a1);
    k2 = __builtin_wasm_trunc_saturate_zero_s_f64x2_i32x4(a2);
    p1 = __builtin_wasm_convert_low_s_i32x4_f64x2(k1);
    p2 = __builtin_wasm_convert_low_s_i32x4_f64x2(k2);

    /* x -= p * log2; */
    xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
    xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
    a1 = wasm_f64x2_mul(p1, xmm0);
    a2 = wasm_f64x2_mul(p2, xmm0);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);
    a1 = wasm_f64x2_mul(p1, xmm1);
    a2 = wasm_f64x2_mul(p2, xmm1);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);

    /* Compute e^x using a polynomial approximation. */
    xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
    xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
    xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
    xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    /* p = 2^k; */
    k1 = wasm_i32x4_add(k1, offset);
    k2 = wasm_i32x4_add(k2, offset);
    k1 = wasm_i32x4_shl(k1, 20);
    k2 = wasm_i32x4_shl(k2, 20);
    k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
    k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
    p1 = k1;
    p2 = k2;

    /* a *= 2^k. */
    a1 = wasm_f64x2_mul(a1, p1);
    a2 = wasm_f64x2_mul(a2, p2);

    /* Store the results. */
    wasm_v128_store(outputs, a1);
    wasm_v128_store(outputs+2, a2);
    outputs += 4;

    num -= 4;
  } while (num != 0);
}

Here's the semi-SIMD version (scalar conversions, rest is SIMD):

void remez5_0_log2_semisimd(const double *inputs, double* outputs, size_t num) {
  assert(num != 0);
  assert(num % 4 == 0);

  const v128_t offset = wasm_i32x4_const(1023, 1023, 0, 0);

  do {
    v128_t k1, k2;
    v128_t p1, p2;
    v128_t a1, a2;
    v128_t xmm0, xmm1;
    v128_t x1, x2;

    /* Load four double values. */
    xmm0 = wasm_f64x2_const(7.09782712893383996843e2, 7.09782712893383996843e2);
    xmm1 = wasm_f64x2_const(-7.08396418532264106224e2, -7.08396418532264106224e2);
    x1 = wasm_v128_load(inputs);
    x2 = wasm_v128_load(inputs+2);
    inputs += 4;

    x1 = wasm_f64x2_pmin(xmm0, x1);
    x2 = wasm_f64x2_pmin(xmm0, x2);
    x1 = wasm_f64x2_pmax(xmm1, x1);
    x2 = wasm_f64x2_pmax(xmm1, x2);

    /* a = x / log2; */
    xmm0 = wasm_f64x2_const(1.4426950408889634073599, 1.4426950408889634073599);
    xmm1 = wasm_f64x2_const(0.0, 0.0);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);

    /* k = (int)floor(a); p = (float)k; */
    p1 = wasm_f64x2_lt(a1, xmm1);
    p2 = wasm_f64x2_lt(a2, xmm1);
    xmm0 = wasm_f64x2_const(1.0, 1.0);
    p1 = wasm_v128_and(p1, xmm0);
    p2 = wasm_v128_and(p2, xmm0);
    a1 = wasm_f64x2_sub(a1, p1);
    a2 = wasm_f64x2_sub(a2, p2);
    const int k1_lo = (int) wasm_f64x2_extract_lane(a1, 0);
    const int k1_hi = (int) wasm_f64x2_extract_lane(a1, 1);
    const int k2_lo = (int) wasm_f64x2_extract_lane(a2, 0);
    const int k2_hi = (int) wasm_f64x2_extract_lane(a2, 1);
    k1 = wasm_i32x4_make(k1_lo, k1_hi, 0, 0);
    k2 = wasm_i32x4_make(k2_lo, k2_hi, 0, 0);
    p1 = wasm_f64x2_make((double) k1_lo, (double) k1_hi);
    p2 = wasm_f64x2_make((double) k2_lo, (double) k2_hi);

    /* x -= p * log2; */
    xmm0 = wasm_f64x2_const(6.93145751953125E-1, 6.93145751953125E-1);
    xmm1 = wasm_f64x2_const(1.42860682030941723212E-6, 1.42860682030941723212E-6);
    a1 = wasm_f64x2_mul(p1, xmm0);
    a2 = wasm_f64x2_mul(p2, xmm0);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);
    a1 = wasm_f64x2_mul(p1, xmm1);
    a2 = wasm_f64x2_mul(p2, xmm1);
    x1 = wasm_f64x2_sub(x1, a1);
    x2 = wasm_f64x2_sub(x2, a2);

    /* Compute e^x using a polynomial approximation. */
    xmm0 = wasm_f64x2_const(1.185268231308989403584147407056378360798378534739e-2, 1.185268231308989403584147407056378360798378534739e-2);
    xmm1 = wasm_f64x2_const(3.87412011356070379615759057344100690905653320886699e-2, 3.87412011356070379615759057344100690905653320886699e-2);
    a1 = wasm_f64x2_mul(x1, xmm0);
    a2 = wasm_f64x2_mul(x2, xmm0);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(0.16775408658617866431779970932853611481292418818223, 0.16775408658617866431779970932853611481292418818223);
    xmm1 = wasm_f64x2_const(0.49981934577169208735732248650232562589934399402426, 0.49981934577169208735732248650232562589934399402426);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    xmm0 = wasm_f64x2_const(1.00001092396453942157124178508842412412025643386873, 1.00001092396453942157124178508842412412025643386873);
    xmm1 = wasm_f64x2_const(0.99999989311082729779536722205742989232069120354073, 0.99999989311082729779536722205742989232069120354073);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm0);
    a2 = wasm_f64x2_add(a2, xmm0);
    a1 = wasm_f64x2_mul(a1, x1);
    a2 = wasm_f64x2_mul(a2, x2);
    a1 = wasm_f64x2_add(a1, xmm1);
    a2 = wasm_f64x2_add(a2, xmm1);

    /* p = 2^k; */
    k1 = wasm_i32x4_add(k1, offset);
    k2 = wasm_i32x4_add(k2, offset);
    k1 = wasm_i32x4_shl(k1, 20);
    k2 = wasm_i32x4_shl(k2, 20);
    k1 = wasm_v32x4_shuffle(k1, k1, 1, 3, 0, 2);
    k2 = wasm_v32x4_shuffle(k2, k2, 1, 3, 0, 2);
    p1 = k1;
    p2 = k2;

    /* a *= 2^k. */
    a1 = wasm_f64x2_mul(a1, p1);
    a2 = wasm_f64x2_mul(a2, p2);

    /* Store the results. */
    wasm_v128_store(outputs, a1);
    wasm_v128_store(outputs+2, a2);
    outputs += 4;

    num -= 4;
  } while (num != 0);
}

Here're the performance results:

Processor (Device)	Performance, scalar conversions	Performance, SIMD conversions	Speedup
AMD PRO A10-8700B	604 MB/s	1428 MB/s	2.4X
AMD A4-7210	76.4 MB/s	89.7 MB/s	1.2X
Intel Xeon W-2135	291 MB/s	355 MB/s	1.2X
Intel Celeron N3060	102 MB/s	130 MB/s	1.3X
Qualcomm Snapdragon 670 (Pixel 3a)	414 MB/s	640 MB/s	1.5X
Samsung Exynos 8895 (Galaxy S8)	389 GB/s	487 GB/s	2.3X

dtig · 2021-02-01T20:28:46Z

Merging as per #429 and discussion above.

This converts 2 f64 to 2 i32, then zeroes the top 2 lanes. These 2 instructions were merged as part of WebAssembly#383. This change also refactors some test cases from simd_conversions out into a script that generates both i32x4.trunc_sat_i32x4_{s,u} and i32x4.trunc_sat_f64x2_{s,u}_zero.

* [interpreter] Implement i32x4.trunc_sat_f64x2_{s,u}_zero This converts 2 f64 to 2 i32, then zeroes the top 2 lanes. These 2 instructions were merged as part of #383. This change also refactors some test cases from simd_conversions out into a script that generates both i32x4.trunc_sat_i32x4_{s,u} and i32x4.trunc_sat_f64x2_{s,u}_zero. * Add encode/decode * Update interpreter/exec/simd.ml Co-authored-by: Andreas Rossberg <[email protected]> Co-authored-by: Andreas Rossberg <[email protected]>

These 2 instructions were merged as part of WebAssembly#383. The test cases are add manually to simd_conversions.wast, using constants from test/core/conversions.wast.

These 2 instructions were merged as part of #383. The test cases are add manually to simd_conversions.wast, using constants from test/core/conversions.wast.

These 2 instructions were merged as part of WebAssembly#383.

These 2 instructions were merged as part of #383.

Instructions were added in WebAssembly#383. Consolidate conversion operations (vcvtop) more, merging int-int widening operations. Drive-by fix extmul definition in syntax (shouldn't include the shape).

Instructions were added in #383. Consolidate conversion operations (vcvtop) more, merging int-int widening operations. Drive-by fix extmul definition in syntax (shouldn't include the shape).

As proposed in WebAssembly/simd#383. Differential Revision: https://reviews.llvm.org/D95012

jlb6740 · 2021-06-10T01:34:17Z

@Maratyszcza @ngzhian Hi .. I am lowering f64x2.convert_low_i32x4_u for sse2 in a backend. I see this conversion here:

f64x2.convert_low_i32x4_u
y = f64x2.convert_low_i32x4_u(x) is lowered to:
MOVAPS xmm_y, xmm_x
UNPCKLPS xmm_y, [wasm_i32x4_splat(0x43300000)]
SUBPD xmm_y, [wasm_f64x2_splat(0x1.0p+52)]

I am trying to understand the magic here. Do you a reference where there is more detail or have an explaination. I checked out v8's implementation here: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=2406;drc=92a46414765f4873251cf9ecefbef130648e2af8 where it is mentioned that

 // dst = [ src_low, 0x43300000, src_high, 0x4330000 ];
 // 0x43300000'00000000 is a special double where the significand bits
 // precisely represents all uint32 numbers.

But I am not really sure what this implies as far as how this conversion works. Any comments?

Maratyszcza · 2021-06-10T02:15:56Z

@jbl 0x43300000 is the high 32 bits of 0x1.0p+52. ULP(0x1.0p+52) == 1.0, so the whole number after unpack is 0x1.0p+52 + double(x)

jlb6740 · 2021-06-14T06:03:07Z

@Maratyszcza Hi, thanks so much. I got it now! The "magic" I was missing was is in the subtract.

Maratyszcza mentioned this pull request Dec 14, 2020

When should our next Sync meeting be? #402

Closed

Maratyszcza force-pushed the double-conversions branch from 4d052b3 to 78ac3e2 Compare December 23, 2020 19:41

Maratyszcza force-pushed the double-conversions branch from 78ac3e2 to d6d4030 Compare January 19, 2021 19:44

tlively added a commit to tlively/binaryen that referenced this pull request Jan 19, 2021

Prototype additional f64x2 conversions

4cb3f50

As proposed in WebAssembly/simd#383, with opcodes coordinated with the WIP V8 prototype.

tlively mentioned this pull request Jan 19, 2021

Prototype additional f64x2 conversions WebAssembly/binaryen#3501

Merged

tlively added a commit to WebAssembly/binaryen that referenced this pull request Jan 20, 2021

Prototype additional f64x2 conversions (#3501)

a4d1b53

As proposed in WebAssembly/simd#383, with opcodes coordinated with the WIP V8 prototype.

tlively added a commit to llvm/llvm-project that referenced this pull request Jan 20, 2021

[WebAssembly] Prototype new f64x2 conversions

11802ec

As proposed in WebAssembly/simd#383. Differential Revision: https://reviews.llvm.org/D95012

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021

Double-precision conversion instructions

ffba47f

Maratyszcza force-pushed the double-conversions branch from d6d4030 to ffba47f Compare February 1, 2021 16:48

dtig approved these changes Feb 1, 2021

View reviewed changes

dtig merged commit dae9f6c into WebAssembly:master Feb 1, 2021

yurydelendik mentioned this pull request Feb 2, 2021

Add SIMD double-precision conversion instructions bytecodealliance/wasm-tools#218

Merged

1 task

Maratyszcza mentioned this pull request Feb 5, 2021

Floating-Point to Nearest Integer Conversions #247

Closed

ngzhian mentioned this pull request Feb 12, 2021

[interpreter] Implement i32x4.trunc_sat_f64x2_{s,u}_zero #466

Merged

ngzhian mentioned this pull request Feb 17, 2021

[interpreter] f64x2.promote_low_f32x4 and f32x4.demote_f64x2_zero #472

Merged

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 18, 2021

[interpreter] Implement f64x2.convert_low_i32x4_{s,u}

53b5196

These 2 instructions were merged as part of WebAssembly#383.

ngzhian mentioned this pull request Feb 18, 2021

[interpreter] Implement f64x2.convert_low_i32x4_{s,u} #473

Merged

ngzhian added a commit that referenced this pull request Feb 23, 2021

[interpreter] Implement f64x2.convert_low_i32x4_{s,u}

a64adb9

These 2 instructions were merged as part of #383.

ngzhian mentioned this pull request Feb 23, 2021

[spectext] Add double precision conversion #482

Merged

arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 30, 2021

[WebAssembly] Prototype new f64x2 conversions

de9c03c

As proposed in WebAssembly/simd#383. Differential Revision: https://reviews.llvm.org/D95012

Maratyszcza mentioned this pull request Apr 19, 2021

relaxed i32x4.trunc_sat_f32x4_{s,u} i32x4.trunc_sat_f64x2_{s,u}_zero WebAssembly/relaxed-simd#21

Open

ngzhian mentioned this pull request Mar 23, 2023

Alternative output for i32x4_relaxed_trunc.wast tests WebAssembly/relaxed-simd#140

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-precision conversion instructions #383

Double-precision conversion instructions #383

Maratyszcza commented Oct 13, 2020 •

edited

Loading

ngzhian commented Dec 14, 2020 •

edited

Loading

Maratyszcza commented Dec 15, 2020

ngzhian commented Dec 23, 2020

Maratyszcza commented Dec 23, 2020

jan-wassenberg commented Dec 24, 2020

dtig commented Jan 5, 2021 •

edited

Loading

jan-wassenberg commented Jan 6, 2021

ngzhian commented Jan 7, 2021

jan-wassenberg commented Jan 7, 2021

ngzhian commented Jan 7, 2021

Maratyszcza commented Jan 7, 2021

ngzhian commented Jan 20, 2021

ngzhian commented Jan 20, 2021

dtig commented Jan 25, 2021

penzn commented Jan 25, 2021

Maratyszcza commented Jan 25, 2021

Maratyszcza commented Jan 26, 2021 •

edited

Loading

penzn commented Jan 26, 2021 •

edited

Loading

Maratyszcza commented Jan 26, 2021

ngzhian commented Jan 26, 2021

jan-wassenberg commented Jan 29, 2021 •

edited

Loading

Maratyszcza commented Jan 29, 2021

ngzhian commented Jan 29, 2021

Maratyszcza commented Jan 29, 2021 •

edited

Loading

dtig commented Feb 1, 2021

jlb6740 commented Jun 10, 2021 •

edited

Loading

Maratyszcza commented Jun 10, 2021

jlb6740 commented Jun 14, 2021

Double-precision conversion instructions #383

Double-precision conversion instructions #383

Conversation

Maratyszcza commented Oct 13, 2020 • edited Loading

Introduction

Applications

Use-cases in OpenCV

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F + AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON

ngzhian commented Dec 14, 2020 • edited Loading

Maratyszcza commented Dec 15, 2020

ngzhian commented Dec 23, 2020

Maratyszcza commented Dec 23, 2020

jan-wassenberg commented Dec 24, 2020

dtig commented Jan 5, 2021 • edited Loading

jan-wassenberg commented Jan 6, 2021

ngzhian commented Jan 7, 2021

jan-wassenberg commented Jan 7, 2021

ngzhian commented Jan 7, 2021

Maratyszcza commented Jan 7, 2021

ngzhian commented Jan 20, 2021

ngzhian commented Jan 20, 2021

dtig commented Jan 25, 2021

penzn commented Jan 25, 2021

Maratyszcza commented Jan 25, 2021

Maratyszcza commented Jan 26, 2021 • edited Loading

penzn commented Jan 26, 2021 • edited Loading

Maratyszcza commented Jan 26, 2021

ngzhian commented Jan 26, 2021

jan-wassenberg commented Jan 29, 2021 • edited Loading

Maratyszcza commented Jan 29, 2021

ngzhian commented Jan 29, 2021

Maratyszcza commented Jan 29, 2021 • edited Loading

dtig commented Feb 1, 2021

jlb6740 commented Jun 10, 2021 • edited Loading

Maratyszcza commented Jun 10, 2021

jlb6740 commented Jun 14, 2021

Maratyszcza commented Oct 13, 2020 •

edited

Loading

ngzhian commented Dec 14, 2020 •

edited

Loading

dtig commented Jan 5, 2021 •

edited

Loading

Maratyszcza commented Jan 26, 2021 •

edited

Loading

penzn commented Jan 26, 2021 •

edited

Loading

jan-wassenberg commented Jan 29, 2021 •

edited

Loading

Maratyszcza commented Jan 29, 2021 •

edited

Loading

jlb6740 commented Jun 10, 2021 •

edited

Loading