Optimize the bresenham algorithm to avoid an unneeded vector allocation#112731
Optimize the bresenham algorithm to avoid an unneeded vector allocation#112731groud wants to merge 1 commit intogodotengine:masterfrom
Conversation
|
Here are the benchmark results: vs gets me averaging several runs: So well, x2.5 faster with the new implementation. |
|
That looks like it just optimizes out the loop, I'd use some tool to ensure it actually loops and does anything with the data |
|
It might be solvable to solve the optimization issue by changing |
|
Can also be verified with checking the output Though there are some aspects of this that this simple benchmarks misses, the non-bulk nature of this case isn't really represented in this case (so it would verify the case where just getting the bulk, but for longer code in the iteration it might not be representative, temporal locality etc.), so would be useful to benchmark the specific cases in the engine it is used to get real world data |
I mean, for me it's very unlikely it gets optimized out. If it were, there would not be a need to spend 10ms doing nothing. That does not make sense. I can try another benchmark, but TBH, I don't really wanna spend a ton of time explaining why allocating a huge vector vs not allocating it makes a difference. |
I noticed that too.
I agree, It's obvious that avoiding the allocation should help performance. However, CPU optimization is not always very straight-forward. Changing something that should obviously be faster can be slower in actuality because of compiler or cpu architecture details. And even if something was optimized, it's possible that there is a bottleneck elsewhere, so the optimization does not have an effect. Considering this complexity, weighing the potential benefit against potential regressions because of the logical change, is why we expect proofs of optimization for performance PRs. |
|
Alright, made the benchmarks anyway with this: vs Results: So well, similar as before.
I ran them from the editor, with `dev_build=yes".
I mean, I understand that, but it's one of the few things we have actual tests for in |
|
The tests might not cover all edge cases so at least would need a functional validation that the change doesn't affect function in edge cases |
|
Thanks for running the test again! I think using addition (along with printing
That means that the optimizer is disabled. We should probably add this to the guidelines, since it's a common mistake to make with benchmarks. |
|
I think it'd also be worth looking at if there's other ways the algorithm could be improved, and if there are any edge cases the tests might not handle correctly, and if other options for optimization might be applicable, it might be possible to estimate the number of points for example and reserve a reasonable amount of storage ahead of time etc., it should be possible to estimate it mathematically from the precision and step |
I mean, at some point you'll have to trust me on that. The Bresenham algorithm is a very simple and straightforward to implement algorithm. There's not a lot of edge-cases besides vertical/horizontal lines and single-point lines. I've tested the algorithm locally with many line drawn (it works flawlessly), and the unit tests pass too. I am positive nothing will be more efficient to improve the algorithm than avoiding an unnecessary allocation here (the algorithm was if fact designed to do no allocation. It's from 1962, when computers had really limited memory available). I don't know what to tell you more than that. It's a fix to an implementation I already knew was suboptimal when I implemented it, I've just had the opportunity to fix it today. |
|
Then the benchmarks should show it, including running profiling on the relevant methods affected by this, it's not no work at all but it's not a lot to just confirm this and confirm that the other considerations and potential other performance aspects are not a problem, I get it but I've seen plenty of very obvious optimizations go away when running release builds and encountering the real world |
|
Anyway, reran the benchmark in release. In fact, it's more like a x20 improvement: |
|
That sounds pretty realistic! There's of course still a possibility that some of the implementation was optimized away due to the constants used in your benchmark, but for me this is along the lines of what I would expect from this change. So I think this suffices as a proof of optimization. |
|
Note that, for now, the optimization is not exposed to users, as I can do it in another PR if we feel the performance improvement is worth. |
4b244bc to
ff55c33
Compare
|
As discussed, I've pushed a change to improve the readability by adding a |
|
See #105292 which was just merged and might benefit from the same change. |
|
I'm having a second thought about the implementation. While I think the Maybe something that would need to be used that way though: What do you think? |
If you want to avoid using two classes, I would prefer the following: Iterable<BresenhamIterator> bresenham_iter(Vector2i p_from, Vector2i p_to) {
return Iterable<BresenhamIterator>(BresenhamIterator(p_from, p_to), BresenhamIterator());
}I introduced |
Oooh, I has no clue we had that. Yeah that seems like a good plan, I'll have a look. |
|
Actually, I say that with the expectation of continuing to use the C++ iterator syntax. |
Yeah I do agree. I think an additional variable is fine, the main problem IMO is that the two nested classes make the code quite hard to read, and it's a bit too much additional code to avoid doing an allocation. So if I can at least shrink it a bit that would be nice. |
ff55c33 to
480398f
Compare
|
Alright, updated the code, we went from 59 added LoC to 35, and we have a single class now. I think it looks better. |
480398f to
0da7c5f
Compare
When I implemented the bresenham algorithm 4 year ago, I didn't really put a lot of effort into optimizing it. This is now fixed for internal use, as this new implementation does not allocate a
Vector<Point2i>anymore. I assume this should improve performance a bit in the TileMap editor when drawing lines.Instead, this PR implements a C++ iterable that can be used that way:
Note that we have tests for the algorithm that all passed correctly. So I think we can safely assume this new implementation is working fine.