-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Arm N2 SVE perfscore file #95700
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Detailsnull
|
@kunalspathak @BruceForstall @TIHan Keeping as a draft PR for the moment in case anyone has a better idea. I wondered at replacing the switch with a lookup table. But, all the exceptions make that non-obvious. Maybe best to keep as the switch for the time being. |
* Returns the 1/2/4/8 byte elemsize for an Arm64 Sve vector instruction | ||
*/ | ||
|
||
/*static*/ unsigned emitter::insSveElemsize(insOpts opt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now thinking this function should return bits instead of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks completely reasonable to me. |
Are they different from what is defined in emitPerfScore_sve.cpp that I have uploaded in #94549? |
@@ -0,0 +1,868 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be included in CMakeLists.txt
.
#define PERFSCORE_THROUGHPUT_N2_LOOP_CONTROL_PROPAGATING PERFSCORE_THROUGHPUT_1C | ||
|
||
// Loop control, propagating and flag setting | ||
#define PERFSCORE_LATENCY_N2_LOOP_CONTROL_PROPAGATING_AND_FLAG_SETTING PERFSCORE_LATENCY_3C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not have platform or hardware specific perfscores, it is probably good to have it that way, but:
- During runtime, I am not sure if we detect hardware and will ever pick the appropriate perfscores
PERFSCORE_LATENCY_N2_*
vs ``PERFSCORE_LATENCY_N3_*`, etc. - This easily can get outdated (which most of the perfscore happen) or unused once we add scores for newer hardware.
- With the new hardware perf scores added, it will be more confusing to pick which one is best for given instruction.
- They are purely used for display purpose and we do not depend on them for any optimizations.
- IMO, we should just follow the existing approach of having these numbers to the best of our ability and then generalize them as the missing instruction numbers are made available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they different from what is defined in emitPerfScore_sve.cpp that I have uploaded in #94549?
I hadn't spotted the latest update of emitPerfScore_sve.cpp
.
But, I now realise where the errors are coming in .... for PERFSCORE_THROUGHPUT you've mixed up the X and C cases. X is when there are more than one per cycle. Looking at the file, INS_sve_fsqrt is marked as PERFSCORE_THROUGHPUT_14X
which can't be right as it's always going to be a slow instruction.
With that fixed I'm thinking maybe we should stick to your file instead of this PR.
What is missing from your file is the cases where the perfscore changes depending on the element size (eg IF_SVE_AK_3A
). I noticed that NEON has these.
I also think for currently unsupported instructions we should set the score high, rather than 1C.
They are purely used for display purpose and we do not depend on them for any optimizations.
What are they used for? Is it just for manually comparing the relative performance of one version of a compiled routine against another? Deciding whether a new coreclr optimisation improves or regresses the code?
If so, then my concern is that the relative performance of difference instructions can change quite a bit over time depending on microarchitecture design, and especially between different types of CPUs (eg: a mobile chip vs a server one) as they are design for different purposes. So, basing decisions on old architectures won't match the actual performance seen on real hardware. Could quite easily add an optimisation that that has a better perfscore, but worse real performance. Right now, comparing a NEON loop against an SVE loop using the existing perfscore data would be a bad idea, as I suspect NEON comes from a much older chip.
Ok, I can see why you wouldn't want multiple sets of perfscores. But, on the other hand I don't think a mix up of scores from a variety of sources is helpful. I'm not sure what a good solution would be though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfscore is currently an estimate of performance that can be used in asmdiffs to determine if a change is good from a performance perspective, not just a code size perspective. See "superpmi.py asmdiffs -metrics" and jit-analyze "--metrics".
We're not sure how good these scores are. @AndyAyersMS has been doing various analyses to determine how usable perfscore is for (auto) tuning CSE instead of running benchmarks and measuring time (which is noisy, etc.).
I think choosing a particular platform and using it for all the data makes sense. Whether we ever change the set of numbers we use based on the particular target platform is TBD. We're nowhere close to being able to make that decision. Updating the existing numbers to be based on a particular modern NEON platform makes sense, but might not be worth the effort, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X is when there are more than one per cycle. Looking at the file, INS_sve_fsqrt is marked as
PERFSCORE_THROUGHPUT_14X
which can't be right as it's always going to be a slow instruction.
The way X
is currently used for throughput < 1
. fsqrt
, that sounds correct because it is 1/14
as minimum.
What are they used for? Is it just for manually comparing the relative performance of one version of a compiled routine against another? Deciding whether a new coreclr optimisation improves or regresses the code?
That's right. We do not even have them on by default in spmi-diffs
pipeline, the default we use is codeSize
. But for register allocation changes, PerfScore
comes handy because those changes can increase the code size but eliminate memory access and that increases the code size, but lowers the perfscore.
If so, then my concern is that the relative performance of difference instructions can change quite a bit over time depending on microarchitecture design, and especially between different types of CPUs
Yes, that is a true statement and we do not keep it upto date.
Right now, comparing a NEON loop against an SVE loop using the existing perfscore data would be a bad idea, as I suspect NEON comes from a much older chip.
I don't think we would do that. We will just compare the disassembly :)
So, basing decisions on old architectures won't match the actual performance seen on real hardware
pretty much to what @BruceForstall mentioned about our experimentation in this area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, comparing a NEON loop against an SVE loop using the existing perfscore data would be a bad idea, as I suspect NEON comes from a much older chip.
Would it make sense to just update all the data to be from a single chip then?
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
The SVE perfscores were full of small errors. I've attempted to fix this by autogenerating a bunch of #defines in
perfscorearm64sven2.h
from the table in #94549 (comment)Sadly there's no obvious way of auto mapping from instruction group to perfscore group. That still needs manually checking. But, thankfully many instruction groups map to exactly one perfscore group.
When adding new instructions groups to codegen, then easiest route for the perfscore is to:
perfscorearm64sven2.h
to find the correct defines.I've named everything with N2 with the intention that at some point in the future we may want to add encodings for additional chipsets. It would also be useful at some point for the general and neon instructions to match the same CPU.