Alternative output for i32x4_relaxed_trunc.wast tests #140

yurydelendik · 2023-03-23T20:43:10Z

Additional variants for i32x4.relaxed_trunc_f32x4_u and i32x4.relaxed_trunc_f64x2_u_zero based on algorithms implemented by SpiderMonkey and v8.

ngzhian · 2023-03-23T20:58:18Z

test/core/relaxed-simd/i32x4_relaxed_trunc.wast


 (assert_return (invoke "i32x4.relaxed_trunc_f32x4_u"
                       (v128.const f32x4 nan -nan nan:0x444444 -nan:0x444444))
               ;; nans -> 0 or UINT32_MAX
               (either (v128.const i32x4 0 0 0 0)
-                       (v128.const i32x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff)))
+                       (v128.const i32x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff)
+                       (v128.const i32x4 0x80000000 0x80000000 0x80000000 0x80000000)))


0x800000 comes from VCVTTPS2UDQ?
Think will be nice to comment here to say which cases lead to which results.
I should have documented it more properly in the first place.

I thought about that but I cannot refer the algorithm/recipe by name, or issue that defines these. I'll just refer SM and V8 as source.

actually it is different on V8, it looks like c0000000, c0000000, c4444400, c4444400. SM has 0x8000000s

What is SM's implementation of this?
V8 is https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.h;l=769;drc=2450f2f5d0ce0da9b8cf493c533f9528ff17bab6

ngzhian · 2023-03-23T21:07:39Z

test/core/relaxed-simd/i32x4_relaxed_trunc.wast

@@ -73,7 +75,8 @@
                       (v128.const f64x2 -1.0 4294967296.0))
               ;; out of range -> saturate or UINT32_MAX
               (either (v128.const i32x4 0 0xffffffff 0 0)
-                       (v128.const i32x4 0xffffffff 0xffffffff 0 0)))
+                       (v128.const i32x4 0xffffffff 0xffffffff 0 0)
+                       (v128.const i32x4 0xfffffffe 0 0 0)))


How do we get 0xfffffffe? This isn't valid, it should only be UINT32_MAX or saturate (based on #21)

IIRC when I was stepping in the debugger: 0xfffffffe comes from addpd after -1.0 + 4503599627370496.0

Is the addpd from implementing this relaxed instruction using the simd128 one? (i32x4.trunc_sat_f64x2_s_zero)

V8 (turbofan?) generates:

0x9923509f796 16 c4e37909c00b vroundpd xmm0,xmm0,0xb 0x9923509f79c 1c 49ba90e7fa2001000000 REX.W movq r10,0x120fae790 0x9923509f7a6 26 c4c1795802 vaddpd xmm0,xmm0,[r10] 0x9923509f7ab 2b c4c178c6c788 vshufps xmm0,xmm0,xmm15,0x88

(notice that liftoff generates trunc_sat_f64x2_s_zero code, not relaxed)

I guess that implementation comes from WebAssembly/simd#383 "x86/x86-64 processors with AVX instruction set".
That codegen should return a 0. If it doesn't then maybe there's a bug in the generated code i think.

Using the trunc_sat instruction to implement the relaxed instruction is always correct (for all the relaxed instructions). This is so that in deterministic mode, we can fall back to SIMD128 instructions.

yurydelendik · 2023-03-23T21:23:51Z

I wonder if it will be easier to (formally?) define algorithms and then use that as a base.

yurydelendik · 2023-03-24T16:15:01Z

More analysis, V8 and SpiderMonkey somewhat similar algorithms for i32x4.relaxed_trunc_f32x4_u :

i32x4.relaxed_trunc_f32x4_u (v8):

0xed74b2397a0    20  c44178c23a01         vcmpps xmm15,xmm0,[r10], (lt)
0xed74b2397a8    28  c501dbf8             vpand xmm15,xmm15,xmm0
0xed74b2397ac    2c  c4c179efc7           vpxor xmm0,xmm0,xmm15
; xmm0 keeps all values that are NaNs or >= 2^31, xmm15 the rest
0xed74b2397b1    31  c4417a5bff           vcvttps2dq xmm15,xmm15
0xed74b2397b6    36  c5f858c0             vaddps xmm0,xmm0,xmm0
0xed74b2397ba    3a  c5f972f008           vpslld xmm0,xmm0,8
0xed74b2397bf    3f  c4c179fec7           vpaddd xmm0,xmm0,xmm15

i32x4.relaxed_trunc_f32x4_u (sm):

0000001A  44 0f 28 3d 2e 00 00 00               movapsx 0x0000000000000050, %xmm15
00000022  44 0f c2 f8 01                        cmpps $0x01, %xmm0, %xmm15
00000027  66 44 0f db f8                        pand %xmm0, %xmm15
0000002C  66 41 0f ef c7                        pxor %xmm15, %xmm0
; xmm15 keeps all values that are non-NaNs and >= 2^31, xmm0 the rest
00000031  c5 fa 5b c0                           vcvttps2dq %xmm0, %xmm0
00000035  45 0f 58 ff                           addps %xmm15, %xmm15
00000039  66 41 0f 72 f7 08                     pslld $0x08, %xmm15
0000003F  66 41 0f fe c7                        paddd %xmm15, %xmm0
00

The i32x4.relaxed_trunc_f64x2_u_zero pretty much identical:

i32x4.relaxed_trunc_f64x2_u_zero (v8):

0x301da9b00796    16  c4e37909c00b         vroundpd xmm0,xmm0,0xb
0x301da9b0079c    1c  49bab0b7d50e01000000 REX.W movq r10,0x10ed5b7b0
0x301da9b007a6    26  c4c1795802           vaddpd xmm0,xmm0,[r10]
0x301da9b007ab    2b  c4c178c6c788         vshufps xmm0,xmm0,xmm15,0x88

i32x4.relaxed_trunc_f64x2_u_zero (sm):

00000019  c4 e3 79 09 c0 0b                     vroundpd $0x0B, %xmm0, %xmm0
0000001F  44 0f 28 3d 29 00 00 00               movapsx 0x0000000000000050, %xmm15
00000027  66 41 0f 58 c7                        addpd %xmm15, %xmm0
0000002C  41 0f c6 c7 88                        shufps $0x88, %xmm15, %xmm0

The 0xFFFFFFFE comes from float64 add operation:

> new Float64Array([-1 + 4503599627370496]).buffer
ArrayBuffer {
  [Uint8Contents]: <fe ff ff ff ff ff 2f 43>,
  byteLength: 8
}

Additional variants for i32x4.relaxed_trunc_f32x4_u and i32x4.relaxed_trunc_f64x2_u_zero based on algorithms implemented by SpiderMonkey and v8.

ngzhian · 2023-03-24T17:09:50Z

For i32x4.relaxed_trunc_f64x2_u_zero, are we missing some instructions?

Marat's suggested codegen is:

VXORPD xmm_tmp, xmm_tmp, xmm_tmp
VMAXPD xmm_y, xmm_x, xmm_tmp
VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
VROUNDPD xmm_y, xmm_y, 0x0B
VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88

I don't see xorpd, vmaxpd, vminpd, in your analysis.

The vmaxpd should get rid of the -1, so you won't get -1 + 4503599627370496

yurydelendik · 2023-03-24T17:34:07Z

I don't see xorpd, vmaxpd, vminpd, in your analysis.

I'm reading https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.h;l=750;drc=2450f2f5d0ce0da9b8cf493c533f9528ff17bab6

ngzhian · 2023-03-24T17:43:01Z

Oh yes, thanks for the pointer. Was looking at the saturated one.
I think that algorithm is wrong. The way we spec i32x4.trunc_sat_f64x2_u_zero, it should either be 0 or 0xFFFFFF for out of range or NaNs, this is the AVX512F instruction VCVTTPD2UDQ (and FCVTZU + UQXTN on AArch64). Otherwise, it should fallback to SIMD trunc+saturate semantics.
If we want to use that algorithm, we would have to change the allowed list of values.
@Maratyszcza wdyt?

ngzhian · 2023-06-02T17:14:06Z

The algorithm used in V8 is wrong, it should fall back to SIMD trunc + saturate semantics. So really, the implementation is the same pre AVX512F.
@dtig https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.h;l=750;drc=2450f2f5d0ce0da9b8cf493c533f9528ff17bab6 will need to use the SIMD trunc implementation.
@Maratyszcza fyi

Edit: had offline discussion with Marat, he prefers to add allow these constants, as long as they don't open up more fingerprinting.

ngzhian · 2023-06-05T20:37:46Z

Let's wait for #144 to land (spec changes) then we can merge this.

ngzhian reviewed Mar 23, 2023

View reviewed changes

ngzhian requested changes Mar 23, 2023

View reviewed changes

yurydelendik force-pushed the i32-trunc-tests branch from c8f02b4 to e7a2158 Compare March 23, 2023 21:08

Alternative output for i32x4_relaxed_trunc.wast tests

0f40696

Additional variants for i32x4.relaxed_trunc_f32x4_u and i32x4.relaxed_trunc_f64x2_u_zero based on algorithms implemented by SpiderMonkey and v8.

yurydelendik force-pushed the i32-trunc-tests branch from e7a2158 to 0f40696 Compare March 24, 2023 16:21

yurydelendik marked this pull request as draft March 24, 2023 20:36

ngzhian mentioned this pull request Jun 5, 2023

Add more allowed return values for relaxed trunc unsigned #144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative output for i32x4_relaxed_trunc.wast tests #140

Alternative output for i32x4_relaxed_trunc.wast tests #140

yurydelendik commented Mar 23, 2023

ngzhian Mar 23, 2023

yurydelendik Mar 23, 2023

yurydelendik Mar 23, 2023

ngzhian Jun 2, 2023

ngzhian Mar 23, 2023

yurydelendik Mar 23, 2023

ngzhian Mar 23, 2023

yurydelendik Mar 23, 2023 •

edited

Loading

ngzhian Mar 23, 2023

yurydelendik commented Mar 23, 2023

yurydelendik commented Mar 24, 2023

ngzhian commented Mar 24, 2023

yurydelendik commented Mar 24, 2023 •

edited

Loading

ngzhian commented Mar 24, 2023 •

edited

Loading

ngzhian commented Jun 2, 2023 •

edited

Loading

ngzhian commented Jun 5, 2023

Alternative output for i32x4_relaxed_trunc.wast tests #140

Are you sure you want to change the base?

Alternative output for i32x4_relaxed_trunc.wast tests #140

Conversation

yurydelendik commented Mar 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurydelendik Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurydelendik commented Mar 23, 2023

yurydelendik commented Mar 24, 2023

ngzhian commented Mar 24, 2023

yurydelendik commented Mar 24, 2023 • edited Loading

ngzhian commented Mar 24, 2023 • edited Loading

ngzhian commented Jun 2, 2023 • edited Loading

ngzhian commented Jun 5, 2023

yurydelendik Mar 23, 2023 •

edited

Loading

yurydelendik commented Mar 24, 2023 •

edited

Loading

ngzhian commented Mar 24, 2023 •

edited

Loading

ngzhian commented Jun 2, 2023 •

edited

Loading