-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64-SVE: Implement IF_SVE_ED_1A, IF_SVE_EE_1A, IF_SVE_EB_1A, IF_SVE_EC_1A #97238
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsPart of #94549. These formats include some JitDisasm output:
cstool output:
Note that there are some diffs in the above outputs due to differences in how immediate values are printed:
cc @dotnet/arm64-contrib.
|
src/coreclr/jit/emitarm64.cpp
Outdated
case IF_SVE_EB_1A: // ........xx...... ..hiiiiiiiiddddd -- SVE broadcast integer immediate (unpredicated) | ||
switch (ins) | ||
{ | ||
// TODO-SVE: Why are these different? MOV is an alias for DUP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention this above, but should these PerfScore values be different? If so, if the instruction is INS_sve_dup
in emitIns_R_I
, we change it to INS_sve_mov
since that is the preferred disassembly. That means the INS_sve_dup
case is unreachable when determining PerfScores. I can change my logic so we set the instruction to INS_sve_dup
in emitIns_R_I
, and then print the instruction as a mov
in emitDispInsHelp
, but that will require introducing some edge case logic into the latter method, as we currently don't do anything special for printing aliased instructions.
If these PerfScores should be the same, then we can avoid that issue altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alias isn't changing the encoding at all. All the bits in the instruction are the same, it's just the way it's printed that changes.
Looking at the table here
Mov and dup exist in multiple rows:
Broadcast logical bitmask immediate to vector | DUPM, MOV | 2 | 2 | V | - |
---|---|---|---|---|---|
Duplicate, immediate and indexed form | DUP, MOV | 2 | 2 | V | - |
Duplicate, scalar form | DUP, MOV | 3 | 1 | M0 | - |
I think you should be using the first one (2,2) for all of IF_SVE_EB_1A.
Was this incorrect in the boiler plate autogenerated code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you for clarifying. Yes, the boilerplate code has different PerfScore values for mov
and dup
for the IF_SVE_EB_1A
format -- it looks like the generator tool mixed up the PerfScore values from different formats. I'll update this to use (2,2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like the generator tool mixed up the PerfScore values from different formats
@kunalspathak something to fix
Failures are unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
case IF_SVE_EC_1A: // ........xx...... ..hiiiiiiiiddddd -- SVE integer add/subtract immediate (unpredicated) | ||
assert(insOptsScalableStandard(id->idInsOpt())); | ||
// Size specifier must be able to fit left-shifted immediate | ||
assert(insOptsScalableAtLeastHalf(id->idInsOpt()) || !id->idOptionalShift()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here.
case IF_SVE_EB_1A: // ........xx...... ..hiiiiiiiiddddd -- SVE broadcast integer immediate (unpredicated) | ||
assert(insOptsScalableStandard(id->idInsOpt())); | ||
// Size specifier must be able to fit left-shifted immediate | ||
assert(insOptsScalableAtLeastHalf(id->idInsOpt()) || !id->idOptionalShift()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I follow why we need insOptsScalableAtLeastHalf(id->idInsOpt()
here? EB_1A
is either https://docsmirror.github.io/A64/2023-06/dup_z_i.html or https://docsmirror.github.io/A64/2023-06/mov_dup_z_i.html and they both take B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct that B
is accepted. I believe if the immediate value is being left-shifted, the size specifier has to be at least 16 bits. The docs you linked say this: "The immediate operand is a signed value in the range -128 to +127, and for element widths of 16 bits or higher it may also be a signed multiple of 256 in the range -32768 to +32512 (excluding 0)."
cstool's behavior doesn't seem to match the documentation exactly, though. From my local testing, if the immediate is left-shifted but the specifier is B
, cstool refuses to parse the instruction, even if the immediate being shifted is 0. I wrote this assert to match the behavior of cstool, such that if id->idOptionalShift
is true, the size specifier cannot be B
. @a74nh does this sound correct to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should go with the arm specs. did you try validating it with LATE_DISASM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried it, and LATE_DISASM matches cstool's behavior, in that it refuses to decode the instruction if a left-shifted immediate (even 0) is used with the B
size specifier. I think I'm just misinterpreting the "(excluding 0)" part of the documentation linked above, and the assert is correct in disallowing the B
specifier to be used with shifted immediates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. Let's wait for @a74nh to comment on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct that B is accepted. I believe if the immediate value is being left-shifted, the size specifier has to be at least 16 bits. The docs you linked say this: "The immediate operand is a signed value in the range -128 to +127, and for element widths of 16 bits or higher it may also be a signed multiple of 256 in the range -32768 to +32512 (excluding 0)."
Yes. The optional shift is being used to specify the range -32768 to +32512. That range is only valid for widths of H
or wider. Therefore, it's not valid to specify a shift with size B
.
It also makes sense as a value shifted by 8 cannot fit into a B
, and so would always be 0.
if the immediate is left-shifted but the specifier is B, cstool refuses to parse the instruction, even if the immediate being shifted is 0.
That makes sense. It's still trying to left shift 0 by 8, and the documentation says left shift isn't valid for B
. At a microarchitecture level, it wouldn't to useful encode a special case just to allow that.
I'm happy with the asserts.
src/coreclr/jit/emitarm64.cpp
Outdated
@@ -6244,14 +6353,21 @@ void emitter::emitIns_R_I(instruction ins, | |||
assert(canEncode); | |||
assert(fmt != IF_NONE); | |||
|
|||
instrDesc* id = emitNewInstrSC(attr, imm); | |||
// Instructions with optional shifts need larger instrDesc to store state | |||
instrDesc* id = optionalShift ? emitNewInstrCns(attr, imm) : emitNewInstrSC(attr, imm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please double check the TP cost for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. This is from an older run, but TP shouldn't have changed since my last push. There is a TP impact, but it's quite small, and I think it only comes from the additional branch from checking optionalShift
. The larger instrDesc
should only affect SVE instructions as of writing, as optionalShift
is only true for the new encodings added in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 compares (probably c++ eliminates redundant comparison) happening for non-sve and that is increasing the tp
. May be in future, we should have emitIns_SVE_R_I()
.
Can you change this to so the frequent branch is always true.
instrDesc* id = nullptr;
if (!optionalShift)
{
id = ...
...
}
else
{
...
id = emitNewInstrCns(attr, imm);
id->idOptionalShift(hasShift);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated diffs look the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we might eventually have to move to emitIns_sve_
, but lets hold off on that for now.
Assembly diffsThroughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.01%)
MinOpts (+0.00% to +0.04%)
FullOpts (+0.01%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.01%)
MinOpts (+0.00% to +0.04%)
FullOpts (+0.01%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.01%)
MinOpts (+0.00% to +0.04%)
FullOpts (+0.01%)
Details here |
src/coreclr/jit/emitarm64.cpp
Outdated
{ | ||
code = emitInsCodeSve(ins, fmt); | ||
code |= insEncodeReg_V_4_to_0(id->idReg1()); // ddddd | ||
code_t imm8 = (code_t)(emitGetInsSC(id) & 0xFF); // iiiiiiii |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given everything else is in functions, then this should be moved into a new helper function:
code_t imm8 = (code_t)(emitGetInsSC(id) & 0xFF);
code |= (imm8 << 5);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Diff results for #97238Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.27% to +0.64%)
MinOpts (+0.70% to +1.21%)
FullOpts (+0.25% to +0.44%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.28% to +0.58%)
MinOpts (+0.70% to +1.21%)
FullOpts (+0.25% to +0.44%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.27% to +0.57%)
MinOpts (+0.70% to +1.21%)
FullOpts (+0.25% to +0.44%)
Details here |
Part of #94549. These formats include some
add
andmov
encodings, which are among the instructions we're prioritizing here.JitDisasm output:
cstool output:
Note that there are some diffs in the above outputs due to differences in how immediate values are printed:
0xFF
vs255
)#imm, LSL #8
, whereas cstool seems to only do it for immediate values of zero.cc @dotnet/arm64-contrib.