-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mono] Add amd64 intrinsics for Vector128 Abs #97024
Conversation
src/mono/mono/mini/cpu-amd64.mdesc
Outdated
@@ -843,6 +843,7 @@ ssse3_shuffle: dest:x src1:x src2:x len:6 clob:1 | |||
sse41_dpps_imm: dest:x src1:x src2:x len:7 clob:1 | |||
sse41_dppd_imm: dest:x src1:x src2:x len:7 clob:1 | |||
vector_andnot: dest:x src1:x src2:x len:7 clob:1 | |||
vector_integer_abs: dest:x src1:x len:16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this len:16
? Shouldn't it in the worst case be len:8
since its a 3-byte movups
+ the 5-byte pabs
instruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since its unary (and therefore not RMW
) wouldn't it just be len:5
? Does it need to specify clob:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did put 16 as I did not know what value to put there and was advised that it should be sufficient to put a large number there for now. Could you share how to calculate the max len correctly?
I think it doesn't need to specify clob:1
as it only writes to the destination register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max length is based on the instruction encoding, its only used to calculate x86 short branches etc. so its ok to overstate it, esp. for rarely used opcodes. Here, the encoding is emit_sse_reg_reg_op4, which has a length of 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in case of emit_sse_reg_reg_op4 it is 4 bytes for instruction and 1 byte per register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vargaz Could you explain how it's 6?
For something like vector_andnot
it's specified as 7
today:
- You have a 3 byte opcode (0x66, 0x0F, 0xDF) + the modr/m byte, giving it a 4-byte base encoding
- If you use an extended register that's +1 additional for the prefix
- If you use an extended a non-basic addressing mode that's +1 additional for the SIB byte
- If you use a displacement that's +4 bytes.
So the actual encoding is 4-10 bytes, depending on what can or can't be included. Since its RMW (the destination is also an input operand) you may also need to emit a movups
/movaps
/movdqu
and that may or may not need to be included.
So, I'd expect vector_andnot
to be specified as one of:
4
(base encoding)5
(base encoding + extended register) -or- (base encoding + complex addressing mode)6
(base encoding + extended register + complex addressing mode)9
(base encoding + complex addressing mode + displacement)10
(base encoding + extended register + complex addressing mode + displacement)- potentially higher if it also factors in RMW handling
- none of this factors in using the VEX encoding, which I don't believe is supported by Mono today
So I'd expect pabs*
to follow a similar model, whatever that is above, and effectively be 1
more than vector_andnot
(since it has a 5 byte base encoding), and only be lesser if RMW handling is factored into the length, since pabs
doesn't need it.
The clob:1
setting looks to indicate it clobbers the first source register, which I don't follow how that works since the dest == source
for RMW instructions (when not using the VEX encoding) and you never have to clobber it, it should always have its original value preserved unless its last use, in which case clobbering doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its emitted using this macro:
#define emit_sse_reg_reg_op4_size(inst,dreg,reg,op1,op2,op3,op4,size) do {
amd64_codegen_pre(inst);
x86_prefix((inst), (unsigned char)(op1));
amd64_emit_rex ((inst), size, (dreg), 0, (reg));
*(inst)++ = (unsigned char)(op2);
*(inst)++ = (unsigned char)(op3);
*(inst)++ = (unsigned char)(op4);
x86_reg_emit ((inst), (dreg), (reg));
amd64_codegen_post(inst);
} while (0)
Which afaiks is 6 bytes.
clob:1 is used for the 2 operand x86 binary operation instructions where dreg needs to be equal to sreg1. It tells the register allocator to allocate both to the same hw register. So it doesn't really clobber anything.
src/mono/mono/mini/simd-intrinsics.c
Outdated
MonoInst *zero = emit_xzero (cfg, klass); | ||
MonoInst *neg = emit_simd_ins (cfg, klass, OP_XBINOP, zero->dreg, args [0]->dreg); | ||
neg->inst_c0 = OP_ISUB; | ||
neg->inst_c1 = MONO_TYPE_I8; | ||
MonoInst *ins = emit_simd_ins (cfg, klass, OP_XBINOP, args [0]->dreg, neg->dreg); | ||
ins->inst_c0 = OP_IMAX; | ||
ins->inst_c1 = MONO_TYPE_I8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being handled correctly on the backend?
There isn't a vector max for i64
until AVX512F (at which point there is also pabsq
available) so this OP_IMAX
has to have some fallback to emulate with a CompareGreaterThan
and then a ConditionalSelect
.
pcmpgtq
is itself only available on SSE4.2 or later, so it also has to be emulated as well on downlevel hardware, doing something like: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L21409-L21459
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime/src/mono/mono/mini/mini-amd64.c
Lines 3738 to 3767 in af87439
static void | |
emit_simd_max_op (MonoCompile *cfg, MonoBasicBlock *bb, MonoInst *ins, int type, int dreg, int sreg1, int sreg2) | |
{ | |
MonoInst *temp; | |
gboolean is64BitNativeInt = FALSE; | |
#if TARGET_SIZEOF_VOID_P == 8 | |
is64BitNativeInt = ins->inst_c1 == MONO_TYPE_I || ins->inst_c1 == MONO_TYPE_U; | |
#endif | |
if (type == MONO_TYPE_I2 || type == MONO_TYPE_U2) { | |
// SSE2, so always available | |
NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2); | |
} else if (!mono_hwcap_x86_has_sse41 || type == MONO_TYPE_I8 || type == MONO_TYPE_U8 || is64BitNativeInt) { | |
// Decompose to t = (s1 > s2), d = (s1 & t) | (s2 & !t) | |
int temp_t = mono_alloc_ireg (cfg); | |
int temp_d1 = mono_alloc_ireg (cfg); | |
int temp_d2 = mono_alloc_ireg (cfg); | |
if (type == MONO_TYPE_U8 || type == MONO_TYPE_U4 || type == MONO_TYPE_U1) | |
emit_simd_gt_un_op (cfg, bb, ins, type, temp_t, sreg1, sreg2); | |
else | |
emit_simd_gt_op (cfg, bb, ins, type, temp_t, sreg1, sreg2); | |
NEW_SIMD_INS (cfg, ins, temp, OP_PAND, temp_d1, temp_t, sreg1); | |
NEW_SIMD_INS (cfg, ins, temp, OP_PANDN, temp_d2, temp_t, sreg2); | |
NEW_SIMD_INS (cfg, ins, temp, OP_POR, dreg, temp_d1, temp_d2); | |
} else { | |
// SSE 4.1 has byte- and dword- operations | |
NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2); | |
} | |
} |
I think this is handled here
src/mono/mono/mini/simd-intrinsics.c
Outdated
if (is_SIMD_feature_supported (cfg, MONO_CPU_X86_SSSE3)) | ||
return emit_simd_ins_for_sig (cfg, klass, OP_VECTOR_IABS, -1, arg0_type, fsig, args); | ||
|
||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return NULL
instead of just sharing the emulation fallback?
That is, you could do this as the following so it also works on SSE2 based hardware:
if ((arg0_type != MONO_TYPE_I8) && is_SIMD_feature_supported (cfg, MONO_CPU_X86_SSSE3))
return emit_simd_ins_for_sig (cfg, klass, OP_VECTOR_IABS, -1, arg0_type, fsig, args);
MonoInst *zero = emit_xzero (cfg, klass);
MonoInst *neg = emit_simd_ins (cfg, klass, OP_XBINOP, zero->dreg, args [0]->dreg);
neg->inst_c0 = OP_ISUB;
neg->inst_c1 = arg0_type;
MonoInst *ins = emit_simd_ins (cfg, klass, OP_XBINOP, args [0]->dreg, neg->dreg);
ins->inst_c0 = OP_IMAX;
ins->inst_c1 = arg0_type;
return ins;
That is roughly how we do it in RyuJIT: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19876-L19896
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Please address the feedback from Tanner before merging this PR.
@@ -1194,6 +1194,11 @@ typedef union { | |||
#define amd64_sse_phaddw_reg_reg(inst, dreg, sreg) emit_sse_reg_reg_op4((inst), (dreg), (sreg), 0x66, 0x0f, 0x38, 0x01) | |||
#define amd64_sse_phaddd_reg_reg(inst, dreg, sreg) emit_sse_reg_reg_op4((inst), (dreg), (sreg), 0x66, 0x0f, 0x38, 0x02) | |||
#define amd64_sse_blendpd_reg_reg(inst,dreg,sreg,imm) emit_sse_reg_reg_op4_imm((inst), (dreg), (sreg), 0x66, 0x0f, 0x3a, 0x0d, (imm)) | |||
|
|||
#define amd64_sse_pabsb_reg_reg(inst, dreg, reg) emit_sse_reg_reg_op4((inst), (dreg), (reg), 0x66, 0x0f, 0x38, 0x1c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I woud prefer amd64_ssse3_pabsb_reg_reg
, to provide more information about these 3 intrinsics.
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The only thing I'm still not really sure about is vector_integer_abs: dest:x src1:x len:6
, as I'm still not entirely sure why its 6
. I gave an explanation above on how the actual byte encoding for x86/x64 works, but its not clear how that relates to what Mono is tracking and why vector_integer_abs
would be smaller than vector_andnot
(considering its base encoding size is 1-byte larger)
The 'len' field is a max length used by the JIT to allocate memory, compute whenever it can use short branches, etc. So it doesn't have to be precise, and some instructions have lengths which are too big, but its not a problem in practice. |
* [Mono] Add amd64 intrinsics for Vector128 Abs * added i64 pointer support * applied review suggestions * fixed types in non ssse3 fallback
Should fix dotnet/perf-autofiling-issues#25809.