-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save 260k in InitValueNumStoreStatics #85945
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsIt appears that the big #define for the intrinsics is causing InitValueNumStoreStatics to get big enough that C++ optimization ends up being disabled, which means a lot of constant operations aren't folded. This change outlines a bunch of those operations, which results in big code size savings, but the operations are still happening at initialization time at runtime. Local size change of release clrjit_win_x64_x64.dll: 2,001,920 -> 1,755,648. InitValueNumStoreStatics goes from 241k to 20k.
|
Is it desirable? Would it be better to initialize it at compile instead? |
I can take a closer look. The table creation isn't as simple as some other cases where we've done this. |
So this doesn't quite make it static, though I've gotten suspicious about initialization order and need to check that. |
src/coreclr/jit/valuenum.h
Outdated
const UINT8& operator[](size_t idx) const { return m_arr[idx]; } | ||
private: | ||
static constexpr unsigned GetArity(unsigned oper); | ||
static constexpr unsigned GetCommutative(unsigned oper); | ||
static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic); | ||
|
||
UINT8 m_arr[VNF_COUNT]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const UINT8& operator[](size_t idx) const { return m_arr[idx]; } | |
private: | |
static constexpr unsigned GetArity(unsigned oper); | |
static constexpr unsigned GetCommutative(unsigned oper); | |
static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic); | |
UINT8 m_arr[VNF_COUNT]; | |
const uint8_t& operator[](size_t idx) const { return m_arr[idx]; } | |
private: | |
static constexpr unsigned GetArity(unsigned oper); | |
static constexpr unsigned GetCommutative(unsigned oper); | |
static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic); | |
uint8_t m_arr[VNF_COUNT]; |
Nit: We want to stick to std C++ type names where possible.
What's missing that would allow this all to become constant? It looks like the bulk of the logic comes from setting the
Then, for the hwintrinsics which can have a non-constant arity, we can simply check for Likewise, the need for The same should be possible for |
I think the short answer is that "just" getting it to work is what's needed, but the details could be problematic. The initial version of this PR was a tiny change that achieved most of the space savings. The current version doesn't build on Linux due to the compiler there complaining about c++14 extensions. Perhaps there is a path forward (maybe less constexpr). There are probably also ordering issues. The
|
I think for this either duplicating the info or changing how its represented in
👍
I think its connected in general to the premise of avoiding dynamic initialization of the tables. I think it can be post-poned as a separate fix, however.
Right, I'm suggesting we add a field to the As it is today, this is basically pessimized to be "if any two base types use different instructions, encode the base type". We should fix that longer term, but short term simply adding a field to the
I'm surprised it takes this long. We have much more metadata in the actual |
An update here - I have added some new data to the tables to support building a const table. For now, the old code is still there and is used to assert that the new method matches the old method. This could be removed in favor of the static_asserts in this PR (and something new to more directly test the vnEncodesResultTypeForHWIntrinsic replacement), though merging this with the assertion and removing it later feels safer. I haven't made any changes to the logic where we use -1 or 0 as special values, so perhaps nothing needs to be split out from this PR at this point? Perhaps I could expand the tables and add assertions to the old logic as an intermediate point, but I'm less concerned since I'm not changing the special values. The win is up to 260k now that all of the code is gone. I do still need to handle the merge conflict and haven't seen a Linux build yet. |
The very slow build times from before were due to having lots of conditionals leading to a huge method (the dynamic initializer) with lots of control flow. That might be resolved due to having a "nicer" initializer now, but I've also made those bit sets straight-line code. It does need some comments. |
Conflicts in hwintrinsiclistxarch.h and gtlist.h
This should now do everything at compile-time (except for the validation in debug builds), and the latest merge conflict was fairly simple. It doesn't change the table at all (include the somewhat strange entry for the BOUNDARY value...), so it should be safe change. |
@tannergooding Could you please take a look at this again? |
HARDWARE_INTRINSIC(Vector64, EqualsAny, 8, 2, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoCodeGen) | ||
HARDWARE_INTRINSIC(Vector64, ExtractMostSignificantBits, 8, 1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoCodeGen) | ||
HARDWARE_INTRINSIC(Vector64, Floor, 8, 1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_NoCodeGen) | ||
HARDWARE_INTRINSIC(Vector64, get_AllBitsSet, 8, 0, true, {INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be encodesExtraTypeArg: false
under the current rules?
The current rule is effectively that if all INS_*
entries are the same then the type doesn't need to be encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review and especially for looking closely enough to ask this question. I was concerned for a bit because the DEBUG checks should catch any case like this, and they aren't firing. It turns out that the "are the same" logic only applies to xarch. For arm64 the current rule is just that having 2+ entries indicates that the type needs to be encoded.
HARDWARE_INTRINSIC(Vector64, Floor, 8, 1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_NoCodeGen) | ||
HARDWARE_INTRINSIC(Vector64, get_AllBitsSet, 8, 0, true, {INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport) | ||
HARDWARE_INTRINSIC(Vector64, get_One, 8, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport) | ||
HARDWARE_INTRINSIC(Vector64, get_Zero, 8, 0, true, {INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for get_Zero
here and for V128
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes overall LGTM.
There's some cleanup that this will enable for the hwintrinsics in particular and which we can do anytime after this gets merged. There are for example, cases like NI_ISA_And
where different instructions might be used but where all operations are equivalent functionality wise and so the type arg doesn't actually need to be encoded.
For some reason, the build analysis appears to be listing the same failure as known and unknown. |
It appears that the big #define for the intrinsics is causing InitValueNumStoreStatics to get big enough that C++ optimization ends up being disabled, which means a lot of constant operations aren't folded. This rewrites it as a const table. It adds some redundant information to the tables that we #include/#define in several places but currently includes many assertions that the old and new values match.
Local size change of release clrjit_win_x64_x64.dll: 2,001,920 -> 1,735,680. InitValueNumStoreStatics (261k) is replaced by 1.2k of static data.
Resolves #85953