Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strict_not_null performance issues with reference types #930

Closed
Amaroker opened this issue Sep 29, 2020 · 18 comments
Closed

strict_not_null performance issues with reference types #930

Amaroker opened this issue Sep 29, 2020 · 18 comments
Assignees

Comments

@Amaroker
Copy link

Amaroker commented Sep 29, 2020

This is a follow up to #396, where @hsutter expressed concern of changes needed to the call site and asked for examples.

To understand my situation better, here's some background information:

It's our company's policy to only use not_null for legacy code / existing APIs that we can't break. However, the first choice for new modern code is always strict_not_null, because we much prefer compile-time errors to run-time errors.

So, the only times we do changes to call sites are during internal refactoring when we strengthen the code with strict_not_null. With this approach I believe Herb's concern doesn't apply.

Consider the following function declaration:

void foo( const strict_not_null<X*>& p )

(1). Sometimes we already have a not_null pointer ready to use directly...

foo( px );

...and (2) sometimes we have a stack based value and we need to construct a temporary strict_not_null object.

foo( strict_not_null<X*>(&x) );

However, (2) is not as efficient as it could be. Performance could be improved by introducing a new explicit and noexcept constructor that avoids the null check. It wouldn't break any existing code.

@NicolBolas
Copy link

NicolBolas commented Sep 30, 2020

So just to be clear, the goal would be to make this work:

foo(strict_not_null<X*>(x));

But because it's explicit, this won't work:

foo(x);

Correct?

Here's a question: given CTAD, what exactly does strict_not_null(x) do, and what exactly do we want it to do?

@Amaroker
Copy link
Author

@NicolBolas
Correct. Still, dangling pointers must be avoided, so rvalue references must not be accepted.

foo( strict_not_null<X*>(X{}) ); // compiler error

@hsutter
Copy link
Collaborator

hsutter commented Sep 30, 2020

Thanks!

However, (2) is not as efficient as it could be. Performance could be improved

Do you have a test case that this null check has measurable overhead?

I'm not trying to be pedantic here... The reason I ask is because this summer I spent several weeks doing an extensive survey and measurement of 20 years' worth of processors, and could not measure any difference for a similar check for if (this != &rhs) ... for reasons which appear to apply equally to the delete nullptr mandated null check and I think equally for this null check.

I didn't put the paper in a mailing yet because I'm working on a related companion paper that I want to publish at the same time, but for this thread I put a copy in my public misc-papers repo here: https://github.com/hsutter/misc-papers/blob/main/2229%20self%20move%20noop%20data.pdf

The first Conclusion at the end of the paper was this: "Branches with the three characteristics above — with one alternative empty, virtually always taking the nonempty path, and testing a condition that uses only memory locations that will be used anyway by the nonempty path — should be considered cost-free on all known hardware."

[ETA the other conclusion] The second Conclusion, pertinent here, was this, emphasis added: "Language [or in this case library] design decisions should be decided on their other merits, but not considering concern about potential performance cost of a branch for an identity test (e.g., as currently required for copy assignment) or not-null test (e.g., as currently required for delete and free). Those tests should be considered free."

But although I was not able to measure a cost to this kind of test on any current hardware, I would be very interested if someone does have a repro, that would be very helpful. As I mention in the paper, I set out to measure the overhead, expecting it to be measurable, and ended up testing so many processors (with the help of many kind people acknowledged in the paper) because I kept failing to measure it.

@hsutter
Copy link
Collaborator

hsutter commented Sep 30, 2020

If the check doesn't have measurable overhead, then we also don't need to answer any of the other design/usability questions raised by trying to optimize for it.

@Amaroker
Copy link
Author

Amaroker commented Sep 30, 2020

@hsutter

I made a very quick test using an online compiler:.

https://rextester.com/l/cpp_online_compiler_gcc
(or https://rextester.com/l/cpp_online_compiler_clang if you prefer, but for some reason the online VC++ version didn't work well)

Open not_null_test.txt and paste the contents.

Click the "Run it" button and it will show the measured times. Run it a few times to get an idea of the average difference.

What I did was to simply create a new class called not_null_custom. It's an exact copy of gsl::not_null, except I modified the constructors.

Edit: For this test I didn't actually add the constructor that takes a reference, but it shows the speedup without the null check.

Edit2: I just noticed that I made a copy paste error. The first test uses not_null_custom in the first if-case. You may replace it with gsl::not_null, although it doesn't really matter. The second test using only not_null_custom is faster anyway.

@Amaroker
Copy link
Author

Amaroker commented Oct 1, 2020

I've now fixed the mistake:

You can run it directly with Godbolt: https://godbolt.org/z/nqre4E

@hsutter
Copy link
Collaborator

hsutter commented Oct 3, 2020

Quick ack: Thanks, @Amaroker, this is very helpful!

@hsutter hsutter self-assigned this Nov 11, 2020
@hsutter
Copy link
Collaborator

hsutter commented Nov 21, 2020

[various edits to fix typos/grammar/clarity and add a note]

Thanks for waiting, I've now had cycles to look at this.

This is an interim update so far, I'll look at it again with the GSL maintainers, probably after the U.S. holiday week.

First, I distilled it down to a minimal repro (by removing code that wasn't exercised in this test and moving the duplicated test loop into a template) and ensured it still preserved the reported performance difference. That got it down from ~560 lines to ~60 lines, and made it easier to see the delta between gsl::not_null and the suggested not_null_custom as well as to start poking at code alternatives.

The first thing I discovered is that there are actually three differences in the exercised parts of the current and revised not_null types... in the proposed not_null_custom one:

  • the constructor from T removes the nullness check (as advertised above)
  • the constructor from T also adds noexcept (but I could not find a performance difference due to the noexcept in this repro)
  • the get() function also removes the nullness check (and this appears to be the actual immediate reason for the performance difference, not the constructor, more on this below)

(Also: The benchmark comparison was being done against gsl::not_null, not gsl::strict_not_null. The original issue mentioned the latter, so maybe this was not intended?)

Then I experimented with different parts of the code, and I found three micro changes each of which at least erased the performance difference and often/always made the current GSL version faster(!). Here is the updated repro: https://godbolt.org/z/nd81q1

In there, you'll find three places where I use #if 1 to mark a code variation -- change that to #if 0 to try the alternative.

The three things I found are:

  • If you change the implementation of the Assert macro (which in GSL is the Expects and Ensures macro) to NOT use GCC __builtin_expect, the current gsl::not_null is faster than the proposed replacement. It's really surprising how often using __builtin_expect is a pessimization, even in cases like this where we really do know the test is nearly always true. In this case the reason is almost certainly because the __builtin_expect branch hint is a strict subset of information already available in the expression which includes a terminate branch from which the compiler can infer not only the branch likelihood but also perform an optimization assumption. SUGGESTION: We should consider removing __builtin_expect from GSL's contracts.

  • If you change get() to remove the assertion, the current gsl::not_null is faster than the proposed replacement more often than not. On seeing this, it does seem strange that we're doing a (redundant) assertion in get() which does not change the value. SUGGESTION: We should probably remove the Ensures(ptr_ != nullptr); contract check inside gsl::not_null<T>::get().

  • Finally, in the test harness itself, I found a weird effect that I didn't think about too hard: If you change the < to a > comparison when selecting which of the two array elements to modify, the current gsl::not_null is faster than the proposed replacement. But with the < it's slower. Weird. I didn't investigate further, but it does show how hard it is to make meaningful benchmarks.

I've verified that all three local changes have the same performance effect in the original 560-line repro

So this is what I've found so far, and the benchmark does not appear to support the proposed change in this Issue.

But it has led to potentially removing other inefficiencies in the GSL not_null and Expects/Ensures implementations, which improvements appear to also give you the improvement you're looking for.

@Amaroker
Copy link
Author

Thanks @hsutter for sharing your observations!

The performance improvements that you suggest would be satisfying.

I agree that there's no need to assert on each read (hence the removal) because the invariant has already been validated when the pointer was set in the constructor.

But I'm still curious. You previously suggested that not null test is to be considered free. But in this example, that's only true in the constructor. So, what really surprises me is that the assertion (not-null test without __builtin_expect) has different performance impact in a constructor and in a member function. Any thoughts on that?

@hsutter
Copy link
Collaborator

hsutter commented Nov 22, 2020

@Amaroker That's true, and I'm still not certain that there's a need to remove that check inside .get() -- because that one is only a performance improvement on some runs, and either of the other two changes by themselves also gives similar benefit.

I'm still investigating, but on some further experiments I ran today it does appear the extra check is free even in get() where it is redundant as long as every write to the pointer is checked. I'm now investigating whether it remains free even with a bit of additional complication in the check logic -- specifically I'm experimenting with calling a customizable violation handler to resolve a few of the other open issues.

Thanks again for the repro!

@KindDragon
Copy link

KindDragon commented Dec 12, 2022

Then I experimented with different parts of the code, and I found three micro changes each of which at least erased the performance difference and often/always made the current GSL version faster(!). Here is the updated repro: https://godbolt.org/z/nd81q1

I rewrote get() changes from your message to quick bench. Code without checks __builtin_unreachable() without std::terminate() produce fastest code:
https://quick-bench.com/q/UvsKwCYjjoSudjex3crTKeCa9zA

GCC 12.2
gcc

Clang 15.0
clang

Without std::terminate() the best option is using __builtin_unreachable() gives the fastest GCC code and with a small improvement for Clang 15 (more with Clang 14) .
With std::terminate() everything is slower, especially if we just call if (!(ptr_ != nullptr)) terminate();.

So, my suggestion to use if (ptr_ == nullptr) __builtin_unreachable(); at the start of get() method

Sorry for resurrecting issue that was last commented 2 years ago :)

EDIT:
After checking assembler code, compiler for any_ptr and non_null_unreachable generate same code.
So, we just need to remove check

@dmitrykobets-msft
Copy link
Member

Hi @KindDragon thanks for resurfacing this.
I see that in this issue there was a discussion of the null check in the not_null constructors, as well as in the get() method.
With the recent closure of isocpp/CppCoreGuidelines#2006, the nullcheck will be removed from the get() method.
Will removing this check resolve this issue, or do you still have performance concerns regarding the not_null constructors?

@KindDragon
Copy link

KindDragon commented Dec 13, 2022

Will removing this check resolve this issue, or do you still have performance concerns regarding the not_null constructors?

Yes it does

@KindDragon
Copy link

After checking the assembler code I found that the compiler is too smart and if we just take the address from a vector element, it knows that it is not null and removes the call to std::terminate.
Also, I modify f function to void f( const auto& px ) { if (px.get()) px->c++; else g++; } so it now generic and can work unique_ptr

Corrected test:
https://quick-bench.com/q/6KW0pLqTg0X-dB9XajbUxPR0HCI

GCC 12.2
gcc

Clang 15.0
clang

Will removing this check resolve this issue, or do you still have performance concerns regarding the not_null constructors?

So based on this new benchmark and assembler code from Godbolt https://godbolt.org/z/6oPjrMcdE I think we should use if (ptr_ == nullptr) __builtin_unreachable(); it generate better machine code for generic code that use it, it just completely eliminate else from f function

@hsutter
Copy link
Collaborator

hsutter commented Dec 13, 2022

Thanks everyone!

Based on feedback from the Guidelines editors in Guidelines issue 2006 and performance measurements, we'll remove the null check from .get().

However, we don't want to use the __builtin_unreachable() / __assume(false) approach. There be dragons there: See P2064, really the whole paper but even just sections 2.2.1 and 3.4.2, including where our back-end experts argue strongly against ever using __assume. The __assume can remove even more code that the user didn’t see or expect to be elided, including contract checks, and including code that precedes the assumption. So yes, it can result in a program that is even faster than just removing the not-null check, but that's because in those cases it's eliding even more stuff in surrounding code that the user might not expect, including contract checks that were supposed to be enabled. Also, it turns the worst case into undefined behavior: If the user abuses a not_null object by bypassing its member functions in the ways shown in the above-linked Guidelines issue 2006, to write a null into the not_null's value member anyway, then an optimizer that trusts the __assume(not-null) can in principle inject undefined behavior into the whole program when the assumption is not true, including before the call to .get(). For details, see paper P2064.

Microsoft's guidance is to not use __assume, see the Warning in the __assume docs page. C++23 is now standardizing this feature as [[assume]], but the MSVC team will likely implement it as nothing (i.e., ignore it) in release builds, and is considering implementing it as an assertion (not assumption) in debug builds because the consequences of the predicate ever being false can be so unexpected and severe.

@KindDragon
Copy link

Thank you for detailed answer

C++23 is now standardizing this feature as [[assume]], but the MSVC team will likely implement it as nothing (i.e., ignore it) in release build

Hmm, but if the programmer wrote [[assume]] shouldn't we trust him? This could only happen in new code and he probably wrote it to optimize the code

@hsutter
Copy link
Collaborator

hsutter commented Dec 14, 2022

Hmm, but if the programmer wrote [[assume]] shouldn't we trust him?

In C++ by default indeed we should trust them. And as a vendor we absolutely want to give them, as The Customer, the power tools they want. And we do trust the programmer even for some "assumptions" such as assuming that loop unrolling is okay or that vector code generation is okay.

However, our experience with the general __assume(expr) we're talking about here (i.e., assume that a variable has a value, or that a data relationship between variables holds) has been that programmers frequently don't understand the consequences and potential dangers. For example, most don't understand that if a reachable assumption can ever be false, it can do much more than just make a variable have incorrect values; it can make a variable appear to have multiple inconsistent simultaneous values (see the actual example on P2064 page 19, "my followup example"), and it can easily inject undefined behavior into the whole program including before the assumption is reached. For another example, even a number of standards committee members were surprised to learn that assumptions can elide (remove) assertions/contracts even when those checks were explicitly enabled. In the MSVC back-end team, nearly always when there's a customer question related to __assume, it's of the form "why did the compiler do that to me?" See again P2064 for more details, in particular the bottom half of page 20. This is why the MSDN page for __assume has long carried a big yellow warning label.

Why not just make the nonportable __assume also a no-op? Well, it is very occasionally useful, and at least __assume has the big benefit that it looks (and is) a bit scary/weird and nonportable, which should cause someone to crack the manual and think twice before using it.

But a standard [[assume]] spelling is really nice and easy to use, and that itself is a problem... it's much more likely to be misused unintentionally (even besides the usual short-term effect that every new C++ standard feature is initially overused). That said, we'll always be open to listening to customer experience: If as [[assume]] is used we see the anticipated misuse dangers not widely happening, and if customers report that their [[assume]]s are giving them benefits on other compilers that they aren't getting on ours (if the other compilers choose to do something with it), then that would be welcome new data.

@KindDragon
Copy link

KindDragon commented Dec 15, 2022

I appreciate your answer. If in the future it will be possible to implement [[assume]] by limiting its action to the current scope, it would allow to do optimization from original proposal, as it is already done in GCC 13 https://godbolt.org/z/qKMe5re6h now (it just use __builtin_unreachable but msvc can implement it in some limited way). Thanks for sharing your thoughts on this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants