Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mono] Add amd64 intrinsics for Vector128 Abs #97024

Merged
merged 4 commits into from
Jan 17, 2024
Merged

[Mono] Add amd64 intrinsics for Vector128 Abs #97024

merged 4 commits into from
Jan 17, 2024

Conversation

jkurdek
Copy link
Member

@jkurdek jkurdek commented Jan 16, 2024

@@ -843,6 +843,7 @@ ssse3_shuffle: dest:x src1:x src2:x len:6 clob:1
sse41_dpps_imm: dest:x src1:x src2:x len:7 clob:1
sse41_dppd_imm: dest:x src1:x src2:x len:7 clob:1
vector_andnot: dest:x src1:x src2:x len:7 clob:1
vector_integer_abs: dest:x src1:x len:16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this len:16? Shouldn't it in the worst case be len:8 since its a 3-byte movups + the 5-byte pabs instruction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since its unary (and therefore not RMW) wouldn't it just be len:5? Does it need to specify clob:1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did put 16 as I did not know what value to put there and was advised that it should be sufficient to put a large number there for now. Could you share how to calculate the max len correctly?

I think it doesn't need to specify clob:1 as it only writes to the destination register

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max length is based on the instruction encoding, its only used to calculate x86 short branches etc. so its ok to overstate it, esp. for rarely used opcodes. Here, the encoding is emit_sse_reg_reg_op4, which has a length of 6.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in case of emit_sse_reg_reg_op4 it is 4 bytes for instruction and 1 byte per register?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vargaz Could you explain how it's 6?

For something like vector_andnot it's specified as 7 today:

  • You have a 3 byte opcode (0x66, 0x0F, 0xDF) + the modr/m byte, giving it a 4-byte base encoding
  • If you use an extended register that's +1 additional for the prefix
  • If you use an extended a non-basic addressing mode that's +1 additional for the SIB byte
  • If you use a displacement that's +4 bytes.

So the actual encoding is 4-10 bytes, depending on what can or can't be included. Since its RMW (the destination is also an input operand) you may also need to emit a movups/movaps/movdqu and that may or may not need to be included.

So, I'd expect vector_andnot to be specified as one of:

  • 4 (base encoding)
  • 5 (base encoding + extended register) -or- (base encoding + complex addressing mode)
  • 6 (base encoding + extended register + complex addressing mode)
  • 9 (base encoding + complex addressing mode + displacement)
  • 10 (base encoding + extended register + complex addressing mode + displacement)
    • potentially higher if it also factors in RMW handling
    • none of this factors in using the VEX encoding, which I don't believe is supported by Mono today

So I'd expect pabs* to follow a similar model, whatever that is above, and effectively be 1 more than vector_andnot (since it has a 5 byte base encoding), and only be lesser if RMW handling is factored into the length, since pabs doesn't need it.

The clob:1 setting looks to indicate it clobbers the first source register, which I don't follow how that works since the dest == source for RMW instructions (when not using the VEX encoding) and you never have to clobber it, it should always have its original value preserved unless its last use, in which case clobbering doesn't matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its emitted using this macro:

#define emit_sse_reg_reg_op4_size(inst,dreg,reg,op1,op2,op3,op4,size) do {
amd64_codegen_pre(inst);
x86_prefix((inst), (unsigned char)(op1));
amd64_emit_rex ((inst), size, (dreg), 0, (reg));
*(inst)++ = (unsigned char)(op2);
*(inst)++ = (unsigned char)(op3);
*(inst)++ = (unsigned char)(op4);
x86_reg_emit ((inst), (dreg), (reg));
amd64_codegen_post(inst);
} while (0)

Which afaiks is 6 bytes.

clob:1 is used for the 2 operand x86 binary operation instructions where dreg needs to be equal to sreg1. It tells the register allocator to allocate both to the same hw register. So it doesn't really clobber anything.

Comment on lines 1483 to 1489
MonoInst *zero = emit_xzero (cfg, klass);
MonoInst *neg = emit_simd_ins (cfg, klass, OP_XBINOP, zero->dreg, args [0]->dreg);
neg->inst_c0 = OP_ISUB;
neg->inst_c1 = MONO_TYPE_I8;
MonoInst *ins = emit_simd_ins (cfg, klass, OP_XBINOP, args [0]->dreg, neg->dreg);
ins->inst_c0 = OP_IMAX;
ins->inst_c1 = MONO_TYPE_I8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being handled correctly on the backend?

There isn't a vector max for i64 until AVX512F (at which point there is also pabsq available) so this OP_IMAX has to have some fallback to emulate with a CompareGreaterThan and then a ConditionalSelect.

pcmpgtq is itself only available on SSE4.2 or later, so it also has to be emulated as well on downlevel hardware, doing something like: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L21409-L21459

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static void
emit_simd_max_op (MonoCompile *cfg, MonoBasicBlock *bb, MonoInst *ins, int type, int dreg, int sreg1, int sreg2)
{
MonoInst *temp;
gboolean is64BitNativeInt = FALSE;
#if TARGET_SIZEOF_VOID_P == 8
is64BitNativeInt = ins->inst_c1 == MONO_TYPE_I || ins->inst_c1 == MONO_TYPE_U;
#endif
if (type == MONO_TYPE_I2 || type == MONO_TYPE_U2) {
// SSE2, so always available
NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2);
} else if (!mono_hwcap_x86_has_sse41 || type == MONO_TYPE_I8 || type == MONO_TYPE_U8 || is64BitNativeInt) {
// Decompose to t = (s1 > s2), d = (s1 & t) | (s2 & !t)
int temp_t = mono_alloc_ireg (cfg);
int temp_d1 = mono_alloc_ireg (cfg);
int temp_d2 = mono_alloc_ireg (cfg);
if (type == MONO_TYPE_U8 || type == MONO_TYPE_U4 || type == MONO_TYPE_U1)
emit_simd_gt_un_op (cfg, bb, ins, type, temp_t, sreg1, sreg2);
else
emit_simd_gt_op (cfg, bb, ins, type, temp_t, sreg1, sreg2);
NEW_SIMD_INS (cfg, ins, temp, OP_PAND, temp_d1, temp_t, sreg1);
NEW_SIMD_INS (cfg, ins, temp, OP_PANDN, temp_d2, temp_t, sreg2);
NEW_SIMD_INS (cfg, ins, temp, OP_POR, dreg, temp_d1, temp_d2);
} else {
// SSE 4.1 has byte- and dword- operations
NEW_SIMD_INS (cfg, ins, temp, simd_type_to_max_op (type), dreg, sreg1, sreg2);
}
}

I think this is handled here

if (is_SIMD_feature_supported (cfg, MONO_CPU_X86_SSSE3))
return emit_simd_ins_for_sig (cfg, klass, OP_VECTOR_IABS, -1, arg0_type, fsig, args);

return NULL;
Copy link
Member

@tannergooding tannergooding Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return NULL instead of just sharing the emulation fallback?

That is, you could do this as the following so it also works on SSE2 based hardware:

if ((arg0_type != MONO_TYPE_I8) && is_SIMD_feature_supported (cfg, MONO_CPU_X86_SSSE3)) 
    return emit_simd_ins_for_sig (cfg, klass, OP_VECTOR_IABS, -1, arg0_type, fsig, args);

MonoInst *zero = emit_xzero (cfg, klass);

MonoInst *neg = emit_simd_ins (cfg, klass, OP_XBINOP, zero->dreg, args [0]->dreg);
neg->inst_c0 = OP_ISUB;
neg->inst_c1 = arg0_type;

MonoInst *ins = emit_simd_ins (cfg, klass, OP_XBINOP, args [0]->dreg, neg->dreg);
ins->inst_c0 = OP_IMAX;
ins->inst_c1 = arg0_type;

return ins;

That is roughly how we do it in RyuJIT: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L19876-L19896

@matouskozak
Copy link
Member

/azp run runtime-extra-platforms

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@fanyang-mono fanyang-mono left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Please address the feedback from Tanner before merging this PR.

@@ -1194,6 +1194,11 @@ typedef union {
#define amd64_sse_phaddw_reg_reg(inst, dreg, sreg) emit_sse_reg_reg_op4((inst), (dreg), (sreg), 0x66, 0x0f, 0x38, 0x01)
#define amd64_sse_phaddd_reg_reg(inst, dreg, sreg) emit_sse_reg_reg_op4((inst), (dreg), (sreg), 0x66, 0x0f, 0x38, 0x02)
#define amd64_sse_blendpd_reg_reg(inst,dreg,sreg,imm) emit_sse_reg_reg_op4_imm((inst), (dreg), (sreg), 0x66, 0x0f, 0x3a, 0x0d, (imm))

#define amd64_sse_pabsb_reg_reg(inst, dreg, reg) emit_sse_reg_reg_op4((inst), (dreg), (reg), 0x66, 0x0f, 0x38, 0x1c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I woud prefer amd64_ssse3_pabsb_reg_reg, to provide more information about these 3 intrinsics.

@jkurdek
Copy link
Member Author

jkurdek commented Jan 17, 2024

/azp run runtime-extra-platforms

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The only thing I'm still not really sure about is vector_integer_abs: dest:x src1:x len:6, as I'm still not entirely sure why its 6. I gave an explanation above on how the actual byte encoding for x86/x64 works, but its not clear how that relates to what Mono is tracking and why vector_integer_abs would be smaller than vector_andnot (considering its base encoding size is 1-byte larger)

@vargaz
Copy link
Contributor

vargaz commented Jan 17, 2024

LGTM.

The only thing I'm still not really sure about is vector_integer_abs: dest:x src1:x len:6, as I'm still not entirely sure why its 6. I gave an explanation above on how the actual byte encoding for x86/x64 works, but its not clear how that relates to what Mono is tracking and why vector_integer_abs would be smaller than vector_andnot (considering its base encoding size is 1-byte larger)

The 'len' field is a max length used by the JIT to allocate memory, compute whenever it can use short branches, etc. So it doesn't have to be precise, and some instructions have lengths which are too big, but its not a problem in practice.

@jkurdek jkurdek merged commit 5ff11b1 into dotnet:main Jan 17, 2024
153 of 164 checks passed
tmds pushed a commit to tmds/runtime that referenced this pull request Jan 23, 2024
* [Mono] Add amd64 intrinsics for Vector128 Abs

* added i64 pointer support

* applied review suggestions

* fixed types in non ssse3 fallback
@github-actions github-actions bot locked and limited conversation to collaborators Feb 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Perf] Linux/x64: 1 Regression on 12/6/2023 10:36:59 AM
5 participants