-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing: add testing.B.Loop for iteration #61515
Comments
It's not clear to me which optimizations those would be. Certainly an unused return value should not be eliminated, but should function arguments be inlined? I have seen benchmarks that explicitly want to check inlining behavior, and also benchmarks that are accidentally-inlinable. I expect that as inlining improves (#61502), the problem of accidentally-inlined arguments will only get worse — but if the (I assume the user would have to create a closure, and then call that closure within the |
I think we would implicitly apply Keep to every function argument and result in the closure passed directly to Loop. I'm pretty sure that's what @rsc was thinking, too.
I definitely agree that this problem is only going to get worse. I'm not sure we need to inhibit inlining, but we need to inhibit optimizations that propagate information across this inlining boundary. That's why I think it works to think of this as applying an implicit Keep to every function argument and result in the closure, and to think of Keep as having a used and non-constant result. (Obviously that's not a very precise specification, though!)
Right. I think if the user specifically wants to benchmark the effect of constant propagation into an inlined function, they would add another layer of function call. We'd only apply the implicit Keep to the direct argument to Loop. That's fairly subtle, but I think such benchmarks are also rare. My other concern with the deoptimization aspect of this is what to do if Loop is called with something other than a function literal. We could say the implicit Keep only applies if it's called directly with a function literal, but that feels.. even weirder. 😅 It may be that |
When b.Loop inlines, |
This proposal has been added to the active column of the proposals project |
The only real question about this was the auto-Keep. All the other benefits listed in the top comment are clear wins. Given that Keep is now likely accept, it seems like this one can be likely accept too. |
Note that the current proposal is used as |
Based on the discussion above, this proposal seems like a likely accept. |
b.Loop has real benefits separate from Keep, so it's worth doing even if we still have questions about Keep. So this seems like it can move to accept. |
If #61405 is accepted it will be possible to range over integers. From the description:
So there might soon be a new, better alternative for benchmark iteration. Does that argue for delaying seeing whether another alternative is needed? |
@timothy-king I don't think that matters here. Range-over-integer is fairly minor syntactic sugar. The issues that |
No change in consensus, so accepted. 🎉 |
I usually wrap a benchmarked op into a It looks like passing a closure to the Loop function would eliminate that requirement as the benchmarked body would be inside the function that can't be inlined anyway (unless the compiler would handle Loop in some special way). (As opposed to being "inlined" inside the |
I mentioned in the original post that For time spent in setup, this effect isn't terrible because it's only amplified by the number of times it needs to retry the benchmark. It's still nice that This effect is much worse if the timer is stopped during the benchmark loop itself, such as to do some per-iteration setup or cleanup. If the benchmark spends 90% of its real time with the timer stopped, it'll run for 10 times longer than expected. It may be that Benchtime has a few goals. One is to minimize the measurement effect by amortizing it to almost zero. That happens pretty quickly; probably well under a millisecond. But also, if you're starting and stopping the timer inside the benchmark loop, you've already given up on that amortization. Another is to roughly capture things that happen over a longer time scale, like garbage collection, scheduling, and cache effects. But these are likely to be captured nearly as well by 1 second of real time as they are by 1 second of benchmark time. Given all of this, I think we could make |
Change https://go.dev/cl/608798 mentions this issue: |
Are you imagining some sort of limit on this? It is fine if the timer is stopped only for short periods, but if it is stopped for long periods then you might only run a single iteration. Maybe that is fine? Obviously stopping the timer for a long time every iteration isn't practical, but they might do so every M iterations. For example, I recently wrote a map iter+delete benchmark that looked something like this.
(The STW cost of StopTimer made this impractical) |
Change https://go.dev/cl/612043 mentions this issue: |
Change https://go.dev/cl/612835 mentions this issue: |
Initial implementation for testing.B.Loop, right now the calculation of b.N are still done in the old fasion way, as of now b.Loop is merely an alias for the old loop over b.N. For #61515. Change-Id: If211d0acc5f0c33df530096dceafe0b947ab0c8e Reviewed-on: https://go-review.googlesource.com/c/go/+/608798 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Junyang Shao <[email protected]> Run-TryBot: Junyang Shao <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Pratt <[email protected]>
With b.Loop() in place, the time measurement of loop scaling could be improved to be tighter. By identifying the first call to b.Loop(), we can avoid measuring the expensive ramp-up time by reset the timer tightly before the loop starts. The remaining loop scaling logic of b.N style loop is largely reused. For #61515. Change-Id: Ia7b8f0a8838f57c00ac6c5ef779d86f8d713c9b6 Reviewed-on: https://go-review.googlesource.com/c/go/+/612835 Reviewed-by: Cherry Mui <[email protected]> Reviewed-by: Junyang Shao <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
After discussion with @cherrymui , we suggest that We will also potentially add a vet check to detect incorrect usage of |
In #61515 (comment), we were thinking "we would implicitly apply Keep to every function argument and result in the closure passed directly to Loop." That was when we were thinking of Loop as an iterator, so now the equivalent would be "implicitly apply Keep to every function argument and result in the body of a What's the reason for only keeping "its last statement, assignment or callexpr" alive? (Not saying this is wrong, but I do want to make sure the reasoning is documented.) |
Personally I feel like Keep on every function argument and return is overly aggressive. In particular, I think I'd find it annoying to have intermediate values (e.g., return from one function passed to another ( But as Bryan mentions in #61515 (comment), I think this is mostly a personal preference and there are footguns in either choice. As a baseline, I think everyone is in agreement that unused function return values should not be eliminated. This feels like the primary pain point to me. As an aside, I think limiting specifically to function return values is odd. Consider
I don't see why BenchmarkAdd should be allowed to eliminate the addition but BenchmarkCall is not. My preferred mental model at the moment is that unused values must not be eliminated, but if they are used they are not required to be materialized. I will admit that this is pretty complicated and having complicated rules is not good for understandability. |
The reason for applying it to function arguments is to prevent unintended constant-folding into a function being benchmarked. There's discussion about this on #61179 if you search for "constant" (there are a few other mentions of "constant", but almost all of it is related to this.) It might be that this isn't worth the added specification complexity, and users can always manually invoke I could see saying that the result of every Statement in the ForStmt Block is treated as if it is used. In practice most of them don't have a result, so this would apply to ExpressionStmt (which includes call statements), Assignment, and ShortVarDecl. I worry saying only the last one is kept is a foot gun because of the asymmetry of statements within the block. |
@aclements
If we do "implicitly apply Keep to every function argument and result in the body of a
Keeping the loop body's last statement, assignment or callexpr alive should be sufficient enough for the users. :D Just a heads up for @prattmic, we do keep assignments alive too, so your two examples will be the same. |
I see the point of keeping the function arguments alive now. But that does feel more fit into As for The asymmetry of statements in the blocks is an actual problem. Right now my implementation keeps alive the last |
I'm not too concerned about constant arguments. If one passes a constant argument, and the function is inlineable and can constant-fold the argument, I think it is okay to do the optimization. It is reasonable that one wants to use a benchmark to show a function is fast given a constant argument. And if constant-folding is not desired, it is easy to make the constant opaque to the compiler, with testing.Keep or some other method. If there are some temporaries during the benchmark computation, I think it is okay to allow the compiler optimize out the temporaries, and just keep the final result. Usually the final result is part of the last statement. That's why I thought about the last statement. If the last statement is not an expression statement or an assignment, perhaps we want to look at the second last statement? Or, for the case that the last statement is a block statement, recursively look at the last statement in the block? That does make the specification complicated, though. |
I'm not so sure about the "last statement" part. It's not too uncommon for a benchmark loop to end with some kind of "bookkeeping" that would break that check. For example, I've added manual calls to RDTSC to measure latencies. |
That is true. I'm not sure either. But I also feel keeping all statements "live" may be too much. Or may be not? Maybe we can analyze some existing benchmarks ( |
I am experimenting with the implementation, and feel this might be a better semantic: For the loop body of testing.B.Loop, no inlining will be performed. (inlining could still happen inside the functions called in the loop body). This way the implementation will be purely on the compiler frontend with no need for This will also solve the pain pointed out in #61179:
|
For the loop body guarded by testing.B.Loop, we disable function inlining and devirtualization inside. The only legal form to be matched is `for b.Loop() {...}`. For #61515 Change-Id: I2e226f08cb4614667cbded498a7821dffe3f72d8 Reviewed-on: https://go-review.googlesource.com/c/go/+/612043 Reviewed-by: Michael Pratt <[email protected]> TryBot-Bypass: Junyang Shao <[email protected]> Commit-Queue: Junyang Shao <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
The proposal also mentions
I.e. not calling the top-level BenchmarkXXX functions multiple times, but just handling it in b.Loop runs. @JunyangShao Would you like to look into that? (Feel free to reopen this one, or open a new issue for tracking) |
Change https://go.dev/cl/627755 mentions this issue: |
testing.B.Loop now does its own loop scheduling without interaction with b.N. b.N will be updated to the actual iterations b.Loop controls when b.Loop returns false. This CL also added tests for fixed iteration count (benchtime=100x case). This CL also ensured that b.Loop() is inlined. For #61515 Change-Id: Ia15f4462f4830ef4ec51327520ff59910eb4bb58 Reviewed-on: https://go-review.googlesource.com/c/go/+/627755 Reviewed-by: Michael Pratt <[email protected]> Commit-Queue: Junyang Shao <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
I'm a bit surprised that |
We definitely considered this, as you can see from the original proposal, making it an iterator doesn't have any obvious benefits because B.Loop doesn't yield anything, but it does have drawbacks in terms of complexity. As a simple function, it's much easier to implement such that the compiler can clearly inline B.Loop and reduce its overhead to almost nothing. As an iterator, it can probably still do that, but you'd be involving quite a bit more complexity. |
Change https://go.dev/cl/632655 mentions this issue: |
A "-N" suffix is left out when GOMAXPROCS is 1. Also match at least 1 space (\s+ instead of \s*), remove trailing '.*' (it's a no-op), and make the test error message style more consistent while here. For #61515. Fixes #70627. Change-Id: Id0a17478ac31e2934a663dd0d3b1b37f24974989 Cq-Include-Trybots: luci.golang.try:gotip-plan9-386 Reviewed-on: https://go-review.googlesource.com/c/go/+/632655 Reviewed-by: Junyang Shao <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
Update July 26, 2023: See this comment for the latest Loop API. The motivation and arguments for this proposal still otherwise apply in full, but the API has switched to the
for b.Loop() { ... }
form proposed in the Alternatives section.Currently, Go benchmarks are required to repeat the body of the benchmark
(*testing.B).N
times. This approach minimizes measurement overhead, but it’s error-prone and has many limitations:As we discovered in cmd/vet: flag benchmarks that don’t use b #38677, it’s surprisingly common for benchmarks to simply forget to use
b.N
.While a vet check can pretty reliably detect forgotten uses of
b.N
, there’s some evidence that many benchmarks useb.N
incorrectly, such as using it to size the input to an algorithm, rather than as an iteration count.Because the benchmark framework doesn’t know when the b.N loop starts, if a benchmark has any non-trivial setup, it’s important for it to use
(*testing.B).ResetTimer
. It’s generally not clear what counts as non-trivial setup, and very hard to detect whenResetTimer
is necessary.Proposal
I propose that we add the following method to
testing.B
and encourage its use overb.N
:This API has several advantages over
b.N
loops:It cannot be misused for something other than an iteration count. It’s still possible for a benchmark to forget entirely to use
b.Loop
, but that can be detected reliably by vet.The benchmarking framework can record time and other metrics around only the benchmarked operation, so benchmarks no longer need to use
ResetTimer
or be careful about their setup.Iteration ramp-up can be done entirely within
b.Loop
, which means that benchmark setup beforeb.Loop
will happen once and only once, rather than at each ramp-up step. For benchmarks with non-trivial setup, this saves a lot of time. Notably, benchmarks with expensive setup can run for far longer than the specified-benchtime
because of the large number of ramp-up steps (setup time is not counted toward the-benchtime
threshold). It’s also less error-prone than using a globalsync.Once
to reduce setup cost, which can have side effects on GC timing and other benchmarks if the computed results are large.As suggested by @rsc,
b.Loop
could be a clear signal to the compiler not to perform certain optimizations in the loop body that often quietly invalidate benchmark results.In the long term, we could collect distributions rather than just averages for benchmark metrics, which would enable deeper insights into benchmark results and far more powerful statistical methods, such as stationarity tests. The way this would work is that
b.Loop
would perform iteration ramp-up only to the point where it can amortize its measurement overhead (ramping up to, say, 1ms), and then repeat this short measurement loop many times until the total time reaches the specified-benchtime
. For short benchmarks, this could easily gather 1,000 samples, rather than just a mean.Alternatives
This proposal is complementary to
testing.Keep
(#61179). It’s an alternative totesting.B.Iterate
(originally in #48768, with discussion now merged into #61179), which essentially combines Keep and Loop. I believe Iterate could have all of the same benefits as Loop, but it’s much clearer how to make Loop low-overhead. If Loop implicitly inhibits compiler optimizations in the body of its callback, then it has similar deoptimization benefits as Iterate. I would argue that Loop has a shallower learning curve than Iterate, though probably once users get used to either they would have similar usability.If #61405 (range-over-func) is accepted, it may be that we want the signature of
Loop
to beLoop(op func() bool) bool
, which would allow benchmarks to be written as:It’s not clear to me what this form should do if the body attempts to
break
orreturn
.Another option is to mimic
testing.PB.Next
. Here, the signature of Loop would beLoop() bool
and it would be used as:This is slightly harder to implement, but perhaps more ergonomic to use. It's more possible to misuse than the version of Loop that takes a callback (e.g., code could do something wrong with the result, or break out of the loop early). But unlike
b.N
, which is easy to misuse, this seems much harder to use incorrectly than to use correctly.cc @bcmills @rsc
The text was updated successfully, but these errors were encountered: