Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative output for i32x4_relaxed_trunc.wast tests #140

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yurydelendik
Copy link
Contributor

Additional variants for i32x4.relaxed_trunc_f32x4_u and i32x4.relaxed_trunc_f64x2_u_zero based on algorithms implemented by SpiderMonkey and v8.


(assert_return (invoke "i32x4.relaxed_trunc_f32x4_u"
(v128.const f32x4 nan -nan nan:0x444444 -nan:0x444444))
;; nans -> 0 or UINT32_MAX
(either (v128.const i32x4 0 0 0 0)
(v128.const i32x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff)))
(v128.const i32x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff)
(v128.const i32x4 0x80000000 0x80000000 0x80000000 0x80000000)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x800000 comes from VCVTTPS2UDQ?
Think will be nice to comment here to say which cases lead to which results.
I should have documented it more properly in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that but I cannot refer the algorithm/recipe by name, or issue that defines these. I'll just refer SM and V8 as source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it is different on V8, it looks like c0000000, c0000000, c4444400, c4444400. SM has 0x8000000s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -73,7 +75,8 @@
(v128.const f64x2 -1.0 4294967296.0))
;; out of range -> saturate or UINT32_MAX
(either (v128.const i32x4 0 0xffffffff 0 0)
(v128.const i32x4 0xffffffff 0xffffffff 0 0)))
(v128.const i32x4 0xffffffff 0xffffffff 0 0)
(v128.const i32x4 0xfffffffe 0 0 0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we get 0xfffffffe? This isn't valid, it should only be UINT32_MAX or saturate (based on #21)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC when I was stepping in the debugger: 0xfffffffe comes from addpd after -1.0 + 4503599627370496.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the addpd from implementing this relaxed instruction using the simd128 one? (i32x4.trunc_sat_f64x2_s_zero)

Copy link
Contributor Author

@yurydelendik yurydelendik Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V8 (turbofan?) generates:

0x9923509f796    16  c4e37909c00b         vroundpd xmm0,xmm0,0xb
0x9923509f79c    1c  49ba90e7fa2001000000 REX.W movq r10,0x120fae790
0x9923509f7a6    26  c4c1795802           vaddpd xmm0,xmm0,[r10]
0x9923509f7ab    2b  c4c178c6c788         vshufps xmm0,xmm0,xmm15,0x88

(notice that liftoff generates trunc_sat_f64x2_s_zero code, not relaxed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that implementation comes from WebAssembly/simd#383 "x86/x86-64 processors with AVX instruction set".
That codegen should return a 0. If it doesn't then maybe there's a bug in the generated code i think.

Using the trunc_sat instruction to implement the relaxed instruction is always correct (for all the relaxed instructions). This is so that in deterministic mode, we can fall back to SIMD128 instructions.

@yurydelendik
Copy link
Contributor Author

I wonder if it will be easier to (formally?) define algorithms and then use that as a base.

@yurydelendik
Copy link
Contributor Author

More analysis, V8 and SpiderMonkey somewhat similar algorithms for i32x4.relaxed_trunc_f32x4_u :

i32x4.relaxed_trunc_f32x4_u (v8):

0xed74b2397a0    20  c44178c23a01         vcmpps xmm15,xmm0,[r10], (lt)
0xed74b2397a8    28  c501dbf8             vpand xmm15,xmm15,xmm0
0xed74b2397ac    2c  c4c179efc7           vpxor xmm0,xmm0,xmm15
; xmm0 keeps all values that are NaNs or >= 2^31, xmm15 the rest
0xed74b2397b1    31  c4417a5bff           vcvttps2dq xmm15,xmm15
0xed74b2397b6    36  c5f858c0             vaddps xmm0,xmm0,xmm0
0xed74b2397ba    3a  c5f972f008           vpslld xmm0,xmm0,8
0xed74b2397bf    3f  c4c179fec7           vpaddd xmm0,xmm0,xmm15

i32x4.relaxed_trunc_f32x4_u (sm):

0000001A  44 0f 28 3d 2e 00 00 00               movapsx 0x0000000000000050, %xmm15
00000022  44 0f c2 f8 01                        cmpps $0x01, %xmm0, %xmm15
00000027  66 44 0f db f8                        pand %xmm0, %xmm15
0000002C  66 41 0f ef c7                        pxor %xmm15, %xmm0
; xmm15 keeps all values that are non-NaNs and >= 2^31, xmm0 the rest
00000031  c5 fa 5b c0                           vcvttps2dq %xmm0, %xmm0
00000035  45 0f 58 ff                           addps %xmm15, %xmm15
00000039  66 41 0f 72 f7 08                     pslld $0x08, %xmm15
0000003F  66 41 0f fe c7                        paddd %xmm15, %xmm0
00

The i32x4.relaxed_trunc_f64x2_u_zero pretty much identical:

i32x4.relaxed_trunc_f64x2_u_zero (v8):

0x301da9b00796    16  c4e37909c00b         vroundpd xmm0,xmm0,0xb
0x301da9b0079c    1c  49bab0b7d50e01000000 REX.W movq r10,0x10ed5b7b0
0x301da9b007a6    26  c4c1795802           vaddpd xmm0,xmm0,[r10]
0x301da9b007ab    2b  c4c178c6c788         vshufps xmm0,xmm0,xmm15,0x88

i32x4.relaxed_trunc_f64x2_u_zero (sm):

00000019  c4 e3 79 09 c0 0b                     vroundpd $0x0B, %xmm0, %xmm0
0000001F  44 0f 28 3d 29 00 00 00               movapsx 0x0000000000000050, %xmm15
00000027  66 41 0f 58 c7                        addpd %xmm15, %xmm0
0000002C  41 0f c6 c7 88                        shufps $0x88, %xmm15, %xmm0

The 0xFFFFFFFE comes from float64 add operation:

> new Float64Array([-1 + 4503599627370496]).buffer
ArrayBuffer {
  [Uint8Contents]: <fe ff ff ff ff ff 2f 43>,
  byteLength: 8
}

Additional variants for i32x4.relaxed_trunc_f32x4_u and i32x4.relaxed_trunc_f64x2_u_zero
based on algorithms implemented by SpiderMonkey and v8.
@ngzhian
Copy link
Member

ngzhian commented Mar 24, 2023

For i32x4.relaxed_trunc_f64x2_u_zero, are we missing some instructions?

Marat's suggested codegen is:

VXORPD xmm_tmp, xmm_tmp, xmm_tmp
VMAXPD xmm_y, xmm_x, xmm_tmp
VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
VROUNDPD xmm_y, xmm_y, 0x0B
VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88

I don't see xorpd, vmaxpd, vminpd, in your analysis.

The vmaxpd should get rid of the -1, so you won't get -1 + 4503599627370496

@yurydelendik
Copy link
Contributor Author

yurydelendik commented Mar 24, 2023

@ngzhian
Copy link
Member

ngzhian commented Mar 24, 2023

Oh yes, thanks for the pointer. Was looking at the saturated one.
I think that algorithm is wrong. The way we spec i32x4.trunc_sat_f64x2_u_zero, it should either be 0 or 0xFFFFFF for out of range or NaNs, this is the AVX512F instruction VCVTTPD2UDQ (and FCVTZU + UQXTN on AArch64). Otherwise, it should fallback to SIMD trunc+saturate semantics.
If we want to use that algorithm, we would have to change the allowed list of values.
@Maratyszcza wdyt?

@yurydelendik yurydelendik marked this pull request as draft March 24, 2023 20:36
@ngzhian
Copy link
Member

ngzhian commented Jun 2, 2023

The algorithm used in V8 is wrong, it should fall back to SIMD trunc + saturate semantics. So really, the implementation is the same pre AVX512F.
@dtig https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.h;l=750;drc=2450f2f5d0ce0da9b8cf493c533f9528ff17bab6 will need to use the SIMD trunc implementation.
@Maratyszcza fyi

Edit: had offline discussion with Marat, he prefers to add allow these constants, as long as they don't open up more fingerprinting.

@ngzhian
Copy link
Member

ngzhian commented Jun 5, 2023

Let's wait for #144 to land (spec changes) then we can merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants