[Mono] Add amd64 intrinsics for Vector128 Abs #97024

jkurdek · 2024-01-16T11:14:02Z

Should fix dotnet/perf-autofiling-issues#25809.

tannergooding · 2024-01-16T14:11:38Z

src/mono/mono/mini/cpu-amd64.mdesc

@@ -843,6 +843,7 @@ ssse3_shuffle: dest:x src1:x src2:x len:6 clob:1
 sse41_dpps_imm: dest:x src1:x src2:x len:7 clob:1
 sse41_dppd_imm: dest:x src1:x src2:x len:7 clob:1
 vector_andnot: dest:x src1:x src2:x len:7 clob:1
+vector_integer_abs: dest:x src1:x len:16


Why is this len:16? Shouldn't it in the worst case be len:8 since its a 3-byte movups + the 5-byte pabs instruction?

Actually, since its unary (and therefore not RMW) wouldn't it just be len:5? Does it need to specify clob:1

I did put 16 as I did not know what value to put there and was advised that it should be sufficient to put a large number there for now. Could you share how to calculate the max len correctly?

I think it doesn't need to specify clob:1 as it only writes to the destination register

The max length is based on the instruction encoding, its only used to calculate x86 short branches etc. so its ok to overstate it, esp. for rarely used opcodes. Here, the encoding is emit_sse_reg_reg_op4, which has a length of 6.

So in case of emit_sse_reg_reg_op4 it is 4 bytes for instruction and 1 byte per register?

@vargaz Could you explain how it's 6?

For something like vector_andnot it's specified as 7 today:

You have a 3 byte opcode (0x66, 0x0F, 0xDF) + the modr/m byte, giving it a 4-byte base encoding

If you use an extended register that's +1 additional for the prefix

If you use an extended a non-basic addressing mode that's +1 additional for the SIB byte

If you use a displacement that's +4 bytes.

So the actual encoding is 4-10 bytes, depending on what can or can't be included. Since its RMW (the destination is also an input operand) you may also need to emit a movups/movaps/movdqu and that may or may not need to be included.

So, I'd expect vector_andnot to be specified as one of:

4 (base encoding)

5 (base encoding + extended register) -or- (base encoding + complex addressing mode)

6 (base encoding + extended register + complex addressing mode)

9 (base encoding + complex addressing mode + displacement)

10 (base encoding + extended register + complex addressing mode + displacement)

potentially higher if it also factors in RMW handling

none of this factors in using the VEX encoding, which I don't believe is supported by Mono today

So I'd expect pabs* to follow a similar model, whatever that is above, and effectively be 1 more than vector_andnot (since it has a 5 byte base encoding), and only be lesser if RMW handling is factored into the length, since pabs doesn't need it.

The clob:1 setting looks to indicate it clobbers the first source register, which I don't follow how that works since the dest == source for RMW instructions (when not using the VEX encoding) and you never have to clobber it, it should always have its original value preserved unless its last use, in which case clobbering doesn't matter.

Its emitted using this macro:

#define emit_sse_reg_reg_op4_size(inst,dreg,reg,op1,op2,op3,op4,size) do {
amd64_codegen_pre(inst);
x86_prefix((inst), (unsigned char)(op1));
amd64_emit_rex ((inst), size, (dreg), 0, (reg));
*(inst)++ = (unsigned char)(op2);
*(inst)++ = (unsigned char)(op3);
*(inst)++ = (unsigned char)(op4);
x86_reg_emit ((inst), (dreg), (reg));
amd64_codegen_post(inst);
} while (0)

Which afaiks is 6 bytes.

clob:1 is used for the 2 operand x86 binary operation instructions where dreg needs to be equal to sreg1. It tells the register allocator to allocate both to the same hw register. So it doesn't really clobber anything.

tannergooding · 2024-01-16T14:22:25Z

src/mono/mono/mini/simd-intrinsics.c

+				MonoInst *zero = emit_xzero (cfg, klass);
+				MonoInst *neg = emit_simd_ins (cfg, klass, OP_XBINOP, zero->dreg, args [0]->dreg);
+				neg->inst_c0 = OP_ISUB;
+				neg->inst_c1 = MONO_TYPE_I8;
+				MonoInst *ins = emit_simd_ins (cfg, klass, OP_XBINOP, args [0]->dreg, neg->dreg);
+				ins->inst_c0 = OP_IMAX;
+				ins->inst_c1 = MONO_TYPE_I8;


Is this being handled correctly on the backend?

There isn't a vector max for i64 until AVX512F (at which point there is also pabsq available) so this OP_IMAX has to have some fallback to emulate with a CompareGreaterThan and then a ConditionalSelect.

pcmpgtq is itself only available on SSE4.2 or later, so it also has to be emulated as well on downlevel hardware, doing something like: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L21409-L21459

runtime/src/mono/mono/mini/mini-amd64.c

Lines 3738 to 3767 in af87439

static void

emit_simd_max_op (MonoCompile *cfg, MonoBasicBlock *bb, MonoInst *ins, int type, int dreg, int sreg1, int sreg2)

{

MonoInst *temp;

gboolean is64BitNativeInt = FALSE;

#if TARGET_SIZEOF_VOID_P == 8

is64BitNativeInt = ins->inst_c1 == MONO_TYPE_I || ins->inst_c1 == MONO_TYPE_U;

#endif

if (type == MONO_TYPE_I2 || type == MONO_TYPE_U2) {

// SSE2, so always available

NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2);

} else if (!mono_hwcap_x86_has_sse41 || type == MONO_TYPE_I8 || type == MONO_TYPE_U8 || is64BitNativeInt) {

// Decompose to t = (s1 > s2), d = (s1 & t) | (s2 & !t)

int temp_t = mono_alloc_ireg (cfg);

int temp_d1 = mono_alloc_ireg (cfg);

int temp_d2 = mono_alloc_ireg (cfg);

if (type == MONO_TYPE_U8 || type == MONO_TYPE_U4 || type == MONO_TYPE_U1)

emit_simd_gt_un_op (cfg, bb, ins, type, temp_t, sreg1, sreg2);

else

emit_simd_gt_op (cfg, bb, ins, type, temp_t, sreg1, sreg2);

NEW_SIMD_INS (cfg, ins, temp, OP_PAND, temp_d1, temp_t, sreg1);

NEW_SIMD_INS (cfg, ins, temp, OP_PANDN, temp_d2, temp_t, sreg2);

NEW_SIMD_INS (cfg, ins, temp, OP_POR, dreg, temp_d1, temp_d2);

} else {

// SSE 4.1 has byte- and dword- operations

NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2);

}

}

I think this is handled here

tannergooding · 2024-01-16T14:25:24Z

src/mono/mono/mini/simd-intrinsics.c

+			if (is_SIMD_feature_supported (cfg, MONO_CPU_X86_SSSE3)) 
+				return emit_simd_ins_for_sig (cfg, klass, OP_VECTOR_IABS, -1, arg0_type, fsig, args);
+
+			return NULL;


Why return NULL instead of just sharing the emulation fallback?

That is, you could do this as the following so it also works on SSE2 based hardware:

if ((arg0_type != MONO_TYPE_I8) && is_SIMD_feature_supported (cfg, MONO_CPU_X86_SSSE3)) return emit_simd_ins_for_sig (cfg, klass, OP_VECTOR_IABS, -1, arg0_type, fsig, args); MonoInst *zero = emit_xzero (cfg, klass); MonoInst *neg = emit_simd_ins (cfg, klass, OP_XBINOP, zero->dreg, args [0]->dreg); neg->inst_c0 = OP_ISUB; neg->inst_c1 = arg0_type; MonoInst *ins = emit_simd_ins (cfg, klass, OP_XBINOP, args [0]->dreg, neg->dreg); ins->inst_c0 = OP_IMAX; ins->inst_c1 = arg0_type; return ins;

That is roughly how we do it in RyuJIT: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19876-L19896

matouskozak · 2024-01-16T14:42:49Z

/azp run runtime-extra-platforms

azure-pipelines · 2024-01-16T14:43:12Z

Azure Pipelines successfully started running 1 pipeline(s).

fanyang-mono

Looks good overall! Please address the feedback from Tanner before merging this PR.

fanyang-mono · 2024-01-16T16:10:46Z

src/mono/mono/arch/amd64/amd64-codegen.h

@@ -1194,6 +1194,11 @@ typedef union {
 #define amd64_sse_phaddw_reg_reg(inst, dreg, sreg) emit_sse_reg_reg_op4((inst), (dreg), (sreg), 0x66, 0x0f, 0x38, 0x01)
 #define amd64_sse_phaddd_reg_reg(inst, dreg, sreg) emit_sse_reg_reg_op4((inst), (dreg), (sreg), 0x66, 0x0f, 0x38, 0x02)
 #define amd64_sse_blendpd_reg_reg(inst,dreg,sreg,imm) emit_sse_reg_reg_op4_imm((inst), (dreg), (sreg), 0x66, 0x0f, 0x3a, 0x0d, (imm))
+
+#define amd64_sse_pabsb_reg_reg(inst, dreg, reg) emit_sse_reg_reg_op4((inst), (dreg), (reg), 0x66, 0x0f, 0x38, 0x1c)


I woud prefer amd64_ssse3_pabsb_reg_reg, to provide more information about these 3 intrinsics.

src/mono/mono/mini/mini-amd64.c

jkurdek · 2024-01-17T15:14:45Z

/azp run runtime-extra-platforms

azure-pipelines · 2024-01-17T15:15:11Z

Azure Pipelines successfully started running 1 pipeline(s).

tannergooding

LGTM.

The only thing I'm still not really sure about is vector_integer_abs: dest:x src1:x len:6, as I'm still not entirely sure why its 6. I gave an explanation above on how the actual byte encoding for x86/x64 works, but its not clear how that relates to what Mono is tracking and why vector_integer_abs would be smaller than vector_andnot (considering its base encoding size is 1-byte larger)

vargaz · 2024-01-17T19:25:54Z

LGTM.

The only thing I'm still not really sure about is vector_integer_abs: dest:x src1:x len:6, as I'm still not entirely sure why its 6. I gave an explanation above on how the actual byte encoding for x86/x64 works, but its not clear how that relates to what Mono is tracking and why vector_integer_abs would be smaller than vector_andnot (considering its base encoding size is 1-byte larger)

The 'len' field is a max length used by the JIT to allocate memory, compute whenever it can use short branches, etc. So it doesn't have to be precise, and some instructions have lengths which are too big, but its not a problem in practice.

* [Mono] Add amd64 intrinsics for Vector128 Abs * added i64 pointer support * applied review suggestions * fixed types in non ssse3 fallback

[Mono] Add amd64 intrinsics for Vector128 Abs

fdcad01

jkurdek requested review from tannergooding and matouskozak January 16, 2024 11:14

jkurdek self-assigned this Jan 16, 2024

jkurdek requested review from fanyang-mono, vargaz, lambdageek and SamMonoRT as code owners January 16, 2024 11:14

dotnet-issue-labeler bot added the area-Codegen-JIT-mono label Jan 16, 2024

vargaz approved these changes Jan 16, 2024

View reviewed changes

build-analysis bot mentioned this pull request Jan 16, 2024

[browser] RuntimeError: memory access out of bounds #96546

Closed

tannergooding reviewed Jan 16, 2024

View reviewed changes

added i64 pointer support

ed3cecf

tannergooding reviewed Jan 16, 2024

View reviewed changes

fanyang-mono reviewed Jan 16, 2024

View reviewed changes

This was referenced Jan 16, 2024

[wasm][debugger] Debugger tests getting randomly aborted on CI #89409

Open

The active test run was aborted. Reason: Test host process crashed dotnet/dnceng#451

Open

runfoapp bot mentioned this pull request Jan 16, 2024

system.text.regularexpressions.tests work item #72834

Open

build-analysis bot mentioned this pull request Jan 16, 2024

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

matouskozak reviewed Jan 17, 2024

View reviewed changes

src/mono/mono/mini/mini-amd64.c Show resolved Hide resolved

jkurdek added 2 commits January 17, 2024 12:04

applied review suggestions

8bf19c6

fixed types in non ssse3 fallback

cbc9054

tannergooding approved these changes Jan 17, 2024

View reviewed changes

jkurdek merged commit 5ff11b1 into dotnet:main Jan 17, 2024
153 of 164 checks passed

tmds pushed a commit to tmds/runtime that referenced this pull request Jan 23, 2024

[Mono] Add amd64 intrinsics for Vector128 Abs (dotnet#97024)

1804911

* [Mono] Add amd64 intrinsics for Vector128 Abs * added i64 pointer support * applied review suggestions * fixed types in non ssse3 fallback

github-actions bot locked and limited conversation to collaborators Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mono] Add amd64 intrinsics for Vector128 Abs #97024

[Mono] Add amd64 intrinsics for Vector128 Abs #97024

jkurdek commented Jan 16, 2024

tannergooding Jan 16, 2024

tannergooding Jan 16, 2024

jkurdek Jan 16, 2024

vargaz Jan 16, 2024

jkurdek Jan 16, 2024

tannergooding Jan 16, 2024

vargaz Jan 16, 2024

tannergooding Jan 16, 2024

jkurdek Jan 17, 2024

tannergooding Jan 16, 2024 •

edited

Loading

matouskozak commented Jan 16, 2024

azure-pipelines bot commented Jan 16, 2024

fanyang-mono left a comment •

edited

Loading

fanyang-mono Jan 16, 2024

jkurdek commented Jan 17, 2024

azure-pipelines bot commented Jan 17, 2024

tannergooding left a comment

vargaz commented Jan 17, 2024

	static void
	emit_simd_max_op (MonoCompile cfg, MonoBasicBlock bb, MonoInst *ins, int type, int dreg, int sreg1, int sreg2)
	{
	MonoInst *temp;
	gboolean is64BitNativeInt = FALSE;

	#if TARGET_SIZEOF_VOID_P == 8
	is64BitNativeInt = ins->inst_c1 == MONO_TYPE_I \|\| ins->inst_c1 == MONO_TYPE_U;
	#endif

	if (type == MONO_TYPE_I2 \|\| type == MONO_TYPE_U2) {
	// SSE2, so always available
	NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2);
	} else if (!mono_hwcap_x86_has_sse41 \|\| type == MONO_TYPE_I8 \|\| type == MONO_TYPE_U8 \|\| is64BitNativeInt) {
	// Decompose to t = (s1 > s2), d = (s1 & t) \| (s2 & !t)
	int temp_t = mono_alloc_ireg (cfg);
	int temp_d1 = mono_alloc_ireg (cfg);
	int temp_d2 = mono_alloc_ireg (cfg);
	if (type == MONO_TYPE_U8 \|\| type == MONO_TYPE_U4 \|\| type == MONO_TYPE_U1)
	emit_simd_gt_un_op (cfg, bb, ins, type, temp_t, sreg1, sreg2);
	else
	emit_simd_gt_op (cfg, bb, ins, type, temp_t, sreg1, sreg2);
	NEW_SIMD_INS (cfg, ins, temp, OP_PAND, temp_d1, temp_t, sreg1);
	NEW_SIMD_INS (cfg, ins, temp, OP_PANDN, temp_d2, temp_t, sreg2);
	NEW_SIMD_INS (cfg, ins, temp, OP_POR, dreg, temp_d1, temp_d2);
	} else {
	// SSE 4.1 has byte- and dword- operations
	NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2);
	}
	}

[Mono] Add amd64 intrinsics for Vector128 Abs #97024

[Mono] Add amd64 intrinsics for Vector128 Abs #97024

Conversation

jkurdek commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

matouskozak commented Jan 16, 2024

azure-pipelines bot commented Jan 16, 2024

fanyang-mono left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkurdek commented Jan 17, 2024

azure-pipelines bot commented Jan 17, 2024

tannergooding left a comment

Choose a reason for hiding this comment

vargaz commented Jan 17, 2024

tannergooding Jan 16, 2024 •

edited

Loading

fanyang-mono left a comment •

edited

Loading