-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Use LSB of vector when converting from vector to mask #116991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure this is correct and will likely break real world code.
The general consideration is that what we expose to the end user is a "full vector", so if they do something like
Vector.ConditionalSelect(mask, x, y)then they are expecting this to behave exactly as(x & mask) | (y & ~mask), which is that it selectsxwhen the corresponding bit inmaskis set andywhen it is clear.The user isn't thinking about or considering the fact that on
xarchtheblendvinstruction only checks themsbor any other similar nuance. They are writing a general purpose cross-platform API where the contract is "bitwise" from the original vector.Correspondingly, we've documented that APIs which take a
maskinput are expectingallbitssetfor that element to indicatetrueandzerofor that element to indicatefalse. There isn't a consideration for them that the underlying hardware happens to have a conversion to/from the internal mask/predicate register representation that can optimize it a particular way.The reason we document this is because all the hardware that doesn't have a dedicated masking register concept has their comparisons return
AllBitsSetorZeroper element, because its the most effective thing to use when dealing purely with vector data and it makes this experience consistent regardless of where you are or what you're doing.Now, there are some dedicated instructions like
Sse41.BlendVariablewhich explicitly take a vector in hardware and which explicitly document that they only consider the MSB of each element, but these are special and require direct usage. They aren't something we always transform something likeVector128.ConditionalSelectinto, because that would result in a different semantic for unknown vectors (anything that wasn'tAllBitsSetvsZeroper element).I believe the same consideration exists here. We cannot implicitly have the internal
ConvertVectorToMaskhelper behaving as if it is(x & 1) != 0, because it will break the rest of the logic flow and normalization considerations we've taken throughout the JIT and laid down for users.More broadly, I'm not understanding what is meant by "the predicate registers conceptually only represents a bit (LSB) of the corresponding vector lanes"
The manual simply documents
It then covers:
Which is to say that for
Vector128<byte>you get16x 1-bitpredicate values. ForVector128<ushort>you get8x 2-bit predicate values,V128<uint>is4x 4-bit, andV128<ulong>is2x 8-bit.For the multi-bit predicates, you get the consideration that only the lsb of the predicate element is actually checked. That is, for
V128<ushort>you get the following, showing that only the lsb is relevant:Which would indicate that the conversion being done here isn't really "correct" in the first place. The existing logic was already producing a correct predicate, but more so a more useful predicate because it can freely be reused for any smaller
T. That is, because we only convert to00or11predicates, theV128<short>is equally reusable as a predicate forV128<byte>since it remains a valid bytewise predicate for the same comparison. If you normalized this to only00vs01, we'd need to then normalize that for use withV128<byte>(and so on for 4-bit and 8-bit predicate elements).Given that, I'd think any "incorrect" behavior would potentially be in
ConvertMaskToVectorwhere something could've produced aMaskforushortelements that was00/01/10/11and where we'd want to produce aVectorthat was stillZero/AllBitsSet/Zero/AllBitsSetfor those respective values; but I believe the conversion codegen is already handling that nuance due to how it uses the mask as part of the selection.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't think we have any consideration where a managed API is taking a
Vector<T>where we aren't expecting it to be preciselyZerovsNon-Zerofor the mask elements for conversion purposes and where we aren't requiring that a mask producing node normalize toZerovsAllBitsSetper element.We only have considerations for what internal nodes might produce when directly consumed, but those are never producing vectors where only the LSB is considered as far as I can tell. There is only the case that a mask might only set the lsb of the predicate element and where we should already be handling that nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider an XOR where the result is then used as a mask:
By the rules we've set up, this should be exactly the same as first converting the inputs to a mask, and then XORing using the predicate XOR instruction. But today this gives a different result:
Using the LSB for masks and now the results match:
In
simd.hthe code was using MSB when converting from Vector to Mask. As of my #115566, it's now using LSB. Either way, neither of these options match the idea of using all true for a mask.Previously we got away with this because we kept masks as
TrueVectornodes and so none of the constant folding optimisations ever got triggered. As of #115566 those optimisations are getting triggered by Fuzzlyn when passing all constant vectors into ConditionalSelect and other such oddities.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be part of the existing rules/logic. Is there a place that is doing this in the Arm64 code already? If so, I think that's a bug and what needs to be fixed.
This is the same general issue of
bool ^ boolwhere C# defines it as0 (literal false) vs 1 (literal true)where-as IL defines it as0 (literal false) vs not-0 (conceptual true)and so simply doing(x ^ y) == truecan be distinct from(x ^ y) != false, because doing a literal xor of booleans can produce something that isn't the same as the literaltrue.For anything like
Sve.CompareEqualwe know that it produces a predicate result. This will actually only set the lsb of the respective "predicate element". However, we guarantee that to the end user if they end up viewing it as a vector it will only beAllBitsSet(predicate was true for that element) orZero(predicate was false for that element) and so theCvtMaskToVectorwill do the right thing for that platform (on both xarch and arm64 this is that each element has 1 bit to check per element, the difference is how those bits are packed throughout the predicate register).The user then doesn't have an API that lets them directly do
XorPredicate, we only light that up via internal pattern recognition and we know the base type of a given mask, so we know whether two marks are compatible. This is correspondingly why the morph logic that producesXorMaskrequires it to explicitly be in the shape ofXor(CvtMaskToVector(mask), CvtMaskToVector(mask))and requires the two masks to be for compatible SIMD sizes/base types, because otherwise it could produce a difference in behavior. -- In general it isn't safe to dopredicate1Bit ^ predicate2Bitbecause they aren't logically compatible. There are some special scenarios involving constants where you can treat them as compatible due to being able to observe the bit patterns, but otherwise you have 1-bit per byte vs 1-bit per 2 bytes, so you'd get a potentially incorrect result whether it was consumed as 1-bit predicate element or 2-bit predicate element value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood it, the intention was that we would be able to optimise Xor to XorPredicate, otherwise we'd be missing SVE instructions. #101294 (comment) and other places. Recently #114438 added support for this, but this was then turned off with #115566. #116854 attempts to enable it again, but it's stuck behind this PR.
Right. It should be fairly simple to update #116854 to work like that.
There are also the approved SVE APIs for AndNot and OrNot. The instruction for these are predicates only. Using allbit masks I'm not sure if we can safely implement them or not. #116628 adds those APIs, but is stuck behind all the above work.
How should a constant vector be turned into a mask if the user passes it into an API that uses a mask?
EvaluateSimdCvtVectorToMask()in simd.h used to check MSB, and now checks LSB. What bit or bits should be turned into true? Only allbitsset?Given we're getting into July now, I'm quite keen to get this all fixed up ASAP.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but only where the optimization is valid. We cannot do it when it would change semantics to managed user.
In practice this hasn't been an issue for any existing code paths. Developers are not taking a mask for
Vector<byte>and combining it with a mask forVector<int>in the wild. -- They notably wouldn't have been able to combine them properly using raw predicates either, just due to the two masks being fundamentally incompatible due to representing different state.This seems like a fairly expensive/verbose approach. The xarch logic is far simpler, https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L10009-L10147, and it catches all the important scenarios for very little expense.
The existing logic just looks for
BitwiseOp(CvtMaskToVec(op1), CvtMaskToVec(op2))and transforms it toBitwiseOpMask(op1, op2)if the two inputs are compatible. This takes a handful of lines of code and reasonable Arm64 should be able to light up with the same logic, just by adding the relevantmaskIntrinsicIdselections for Arm64 SVE. -- It and other paths that are handling the x64 KMASK setup were intentionally setup so that Arm64 Predicates could be plugged in trivially when ready.If they are only for predicates, they likely shouldn't be exposed. There's not really a need to expose them either and I think the names are potentially confusing due to the them being
nand/nor-- We already haveAndNotAPIs and it is expected they behave likeBICdoes, which isx & ~y, where-as the predicates are doing~(x & y)and~(x | y)instead.x64 has a similar predicate only concept with
xnorwhich does~(x ^ y), but it not having a managed API isn't problematic because the user is more likely to write the natural pattern (which can also be optimized for other platforms or ISAs, like AdvSimd). We simply recognizeNOT(XOR(CvtMaskToVec(x), CvtMaskToVec(y)))and translate it toXnorMask(x, y)-- The normalization toNOT(XOR(...))also makes it easier to light up constant folding, comparison inversions, and other basic optimizations; so we intentionally decompose things likeV128.AndNot(x, y)intoAND(x, NOT(y))instead of importing it directly asAND_NOT(x, y). We then construct the actualAND_NOTmuch later, such as in lowering, when we know that no other optimization work is needed and we just want to optimize the instruction selectionWe basically have 4 options here:
non-zeroistrue,zeroisfalseboolworksall-bits-setistrue, anything else isfalsemsb setistrue,msb clearisfalsevpmov*2mworks on xarchlsb setistrue,lsb clearisfalsepmov (to predicate)works on arm64The intended behavior is the first:
non-zeroistrue,zeroisfalse. This is the intended behavior because it corresponds to howboolworks already, is the cheapest to emulate for unknown inputs, lets us trivially optimize for both xarch and arm64 for known inputs, and allows cross platform determinism (including simplistic mapping to hardware without specialized predicate/masking support).In practice the need to use this conversion should be rare, because users should not be taking arbitrary (especially non-constant) vectors and trying to use them as inputs into scenarios that expect predicates/masks. Most typically code is using the output of some
Compareor similar instruction which directly produced a mask, potentially doing some minor bitwise operations on it to combine it with other masks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but sadly Fuzzlyn doesn't agree - currently this is where all the failures are coming from, with code I doubt we'd ever see in the wild.
I'll get this fixed up this week. I'll draft a new PR to fix the mask handling. And then when combined with #116852 that should fix all the current issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It's important we still handle it correctly and it's great Fuzzlyn is catching them. Just it's not critical for it to be "peak efficiency" for codegen since it's not likely going to happen in the real world.
Rather we want to ensure that morph/folding/lowering catch the typical real world cases and get the good codegen there. From what I've seen of the existing x64 support (which has been in production a couple years now), we're getting most of the right stuff already and most of that is setup for Arm64 to hook in for the same benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put the required changes to
EvaluateSimdCvtVectorToMask()in #116852.This PR should probably be closed now.