Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer xor to sub for counting bits #4693

Closed
wants to merge 1 commit into from
Closed

Conversation

AZero13
Copy link
Contributor

@AZero13 AZero13 commented May 24, 2024

For subtracting from 63 or 31 alone, exclusive or is faster than sub because exclusive or doesn't set any flags, unlike sub.

@AZero13 AZero13 requested a review from a team as a code owner May 24, 2024 00:32
For subtracting from 63 or 31 alone, exclusive or is faster than sub because exclusive or doesn't set any flags, unlike sub.
@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented May 24, 2024

exclusive or doesn't set any flags, unlike sub.

This is not correct. On x86 and x86-64 xor does set flags.

Still, this is indeed a bit faster, because xor is commutative, the compiler reorders args, and the constant becomes immediate, which saves an instruction.

The previous attempt was rejected due to little gain for more complexity, it was suggested that the compiler should implement this optimization instead.

If we are doing this, should also change the occurrences in vector_algorithms.cpp. And the remaining occurrences in __msvc_bit_utils.hpp. Some may be spelled as _Digits - 1 - _Result or something.

@AZero13
Copy link
Contributor Author

AZero13 commented May 24, 2024

The compiler doing this was actually rejected for multiple reasons.

@AZero13
Copy link
Contributor Author

AZero13 commented May 24, 2024

It also is because this only works if the compiler knows bitscan returns a number between 0 and 31

@Alcaro
Copy link
Contributor

Alcaro commented May 24, 2024

The relevant previous discussion is #3925 (comment)

Clang can do this optimization, so in theory, it's possible. https://godbolt.org/z/hc8hrj8eo

But in practice, the definition of _BitScanReverse is more complex than __builtin_clz - latter has undefined behavior if input is zero, former returns an unspecified value that's not guaranteed to be in the range 0 to 31 (this optimization is only valid if rhs <= lhs; 31-32 != 31^32).

It is possible to model the builtin in a way that allows this optimization, but it's more complicated than simply 'the return value is always 0 to 31, and if it isn't, you dun goofed and it's not a compiler bug'.

Long-term, teaching the compiler to optimize that is the better solution. But I don't know how long term; it may be long enough that it's wiser to install a short-term solution as well (with a TRANSITION comment).

@StephanTLavavej StephanTLavavej added performance Must go faster decision needed We need to choose something before working on this labels May 28, 2024
@AlexGuteniev
Copy link
Contributor

Previous discussion was related specifically to vector_algoritmgs.cpp #3925 (comment)

The optimization in that place has less benefits and adds more confusion, so the outcome in general may be reconsidered.

@StephanTLavavej StephanTLavavej removed the decision needed We need to choose something before working on this label May 29, 2024
@StephanTLavavej
Copy link
Member

Thanks for looking into this. We talked about this in the weekly maintainer meeting, and have decided that the potential savings of less than a cycle are outweighed by the decrease in code clarity. We believe that the compiler should ultimately be responsible for emitting ideal codegen here, and DevCom-10487304 is still active, so we're going to close this PR without merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants