Skip to content

Optimize the bresenham algorithm to avoid an unneeded vector allocation#112731

Open
groud wants to merge 1 commit intogodotengine:masterfrom
groud:optimize_bresenham
Open

Optimize the bresenham algorithm to avoid an unneeded vector allocation#112731
groud wants to merge 1 commit intogodotengine:masterfrom
groud:optimize_bresenham

Conversation

@groud
Copy link
Member

@groud groud commented Nov 13, 2025

When I implemented the bresenham algorithm 4 year ago, I didn't really put a lot of effort into optimizing it. This is now fixed for internal use, as this new implementation does not allocate a Vector<Point2i> anymore. I assume this should improve performance a bit in the TileMap editor when drawing lines.

Instead, this PR implements a C++ iterable that can be used that way:

for (Vector2i point : Geometry2D::Bresenham(from, to)) {
    ...
}

Note that we have tests for the algorithm that all passed correctly. So I think we can safely assume this new implementation is working fine.

@groud groud added this to the 4.x milestone Nov 13, 2025
@groud groud requested review from a team as code owners November 13, 2025 13:00
@groud groud requested a review from a team as a code owner November 13, 2025 13:00
@groud
Copy link
Member Author

groud commented Nov 13, 2025

Here are the benchmark results:

	{
		Vector2i a;
		uint64_t old_time = OS::get_singleton()->get_ticks_usec();
		Vector<Vector2i> points = Geometry2D::bresenham_line(Vector2i(), Vector2i(1000000,5000));
		for (const Vector2i &v : points) {
			a = v; // Do something
		}
		uint64_t new_time = OS::get_singleton()->get_ticks_usec();
		print_line("Before", new_time - old_time);
	}

vs

	{
		Vector2i a;
		uint64_t old_time = OS::get_singleton()->get_ticks_usec();
		for (const Vector2i &v : Geometry2D::Bresenham(Vector2i(), Vector2i(1000000,5000))) {
			a = v; // Do something
		}
		uint64_t new_time = OS::get_singleton()->get_ticks_usec();
		print_line("After", new_time - old_time);
	}
}

gets me averaging several runs:

Before ~ 25ms
After ~ 10ms

So well, x2.5 faster with the new implementation.
This is with a single huge line though. As for smaller lines it's harder to quantify anyway, as both functions would be quite fast anyway.

@AThousandShips
Copy link
Member

That looks like it just optimizes out the loop, I'd use some tool to ensure it actually loops and does anything with the data

@Ivorforce
Copy link
Member

Ivorforce commented Nov 13, 2025

It might be solvable to solve the optimization issue by changing Vector2i a; to volatile. But I usually declare some function in another header, define it in another cpp file, and call it with the final argument (v). Since calls to other compile units are blackboxes, that's often a very cheap way to force the compiler to keep the result. Just make sure LTO is disabled.

@AThousandShips
Copy link
Member

AThousandShips commented Nov 13, 2025

Can also be verified with checking the output

Though there are some aspects of this that this simple benchmarks misses, the non-bulk nature of this case isn't really represented in this case (so it would verify the case where just getting the bulk, but for longer code in the iteration it might not be representative, temporal locality etc.), so would be useful to benchmark the specific cases in the engine it is used to get real world data

@groud
Copy link
Member Author

groud commented Nov 13, 2025

It might be solvable to solve the optimization issue by changing Vector2i a; to volatile. But I usually declare some function in another header, define it in another cpp file, and call it with the final argument (v). Since calls to other compile units are blackboxes, that's often a very cheap way to force the compiler to keep the result. Just make sure LTO is disabled.

I mean, for me it's very unlikely it gets optimized out. If it were, there would not be a need to spend 10ms doing nothing. That does not make sense.

I can try another benchmark, but TBH, I don't really wanna spend a ton of time explaining why allocating a huge vector vs not allocating it makes a difference.

@Ivorforce
Copy link
Member

I mean, for me it's very unlikely it gets optimized out. If it were, there would not be a need to spend 10ms doing nothing. That does not make sense.

I noticed that too.
To be honest, the fact that it did not pass in 0ms is a bit surprising to me. gcc and clang are very good at recognizing when a variable or iterator can be omitted from the build.
Did you run the benchmark in a dev build, or in an optimized build?

I can try another benchmark, but TBH, I don't really wanna spend a ton of time explaining why allocating a huge vector vs not allocating it makes a difference.

I agree, It's obvious that avoiding the allocation should help performance.

However, CPU optimization is not always very straight-forward. Changing something that should obviously be faster can be slower in actuality because of compiler or cpu architecture details. And even if something was optimized, it's possible that there is a bottleneck elsewhere, so the optimization does not have an effect.

Considering this complexity, weighing the potential benefit against potential regressions because of the logical change, is why we expect proofs of optimization for performance PRs.

@groud
Copy link
Member Author

groud commented Nov 13, 2025

Alright, made the benchmarks anyway with this:

	{
		Vector2i a;
		uint64_t old_time = OS::get_singleton()->get_ticks_usec();
		Vector<Vector2i> points = Geometry2D::bresenham_line(Vector2i(), Vector2i(1000000,5000));
		for (const Vector2i &v : points) {
			a += v; // Do something
		}
		uint64_t new_time = OS::get_singleton()->get_ticks_usec();
		print_line("Before", new_time - old_time);
		print_line(a);
	}

vs

	{
		Vector2i a;
		uint64_t old_time = OS::get_singleton()->get_ticks_usec();
		Geometry2D::Bresenham points = Geometry2D::Bresenham(Vector2i(), Vector2i(1000000,5000));
		for (const Vector2i &v : points) {
			a += v; // Do something
		}
		uint64_t new_time = OS::get_singleton()->get_ticks_usec();
		print_line("After", new_time - old_time);
		print_line(a);
	}

Results:

Before: ~26ms
After: ~11ms

So well, similar as before.

Did you run the benchmark in a dev build, or in an optimized build?

I ran them from the editor, with `dev_build=yes".

Considering this complexity, weighing the potential benefit against potential regressions because of the logical change, is why we expect proofs of optimization for performance PRs.

I mean, I understand that, but it's one of the few things we have actual tests for in test_geometry_2d.h. I think we're pretty safe on the regression parts (unless there are some situations not covered by them. Though well, it's a 2 args functions, there are not that many cases to cover I guess)

@AThousandShips
Copy link
Member

The tests might not cover all edge cases so at least would need a functional validation that the change doesn't affect function in edge cases

@Ivorforce
Copy link
Member

Ivorforce commented Nov 13, 2025

Thanks for running the test again! I think using addition (along with printing a) may work to 'trick' the optimizer.

I ran them from the editor, with `dev_build=yes".

That means that the optimizer is disabled. We should probably add this to the guidelines, since it's a common mistake to make with benchmarks.
I hate to ask this, but please run the benchmark again with a non-dev build (debug or release; release is preferred).

@AThousandShips
Copy link
Member

AThousandShips commented Nov 13, 2025

I think it'd also be worth looking at if there's other ways the algorithm could be improved, and if there are any edge cases the tests might not handle correctly, and if other options for optimization might be applicable, it might be possible to estimate the number of points for example and reserve a reasonable amount of storage ahead of time etc., it should be possible to estimate it mathematically from the precision and step

@groud
Copy link
Member Author

groud commented Nov 13, 2025

I think it'd also be worth looking at if there's other ways the algorithm could be improved, and if there are any edge cases the tests might not handle correctly, and if other options for optimization might be applicable, it might be possible to estimate the number of points for example and reserve a reasonable amount of storage ahead of time etc.

I mean, at some point you'll have to trust me on that. The Bresenham algorithm is a very simple and straightforward to implement algorithm. There's not a lot of edge-cases besides vertical/horizontal lines and single-point lines. I've tested the algorithm locally with many line drawn (it works flawlessly), and the unit tests pass too. I am positive nothing will be more efficient to improve the algorithm than avoiding an unnecessary allocation here (the algorithm was if fact designed to do no allocation. It's from 1962, when computers had really limited memory available).

I don't know what to tell you more than that. It's a fix to an implementation I already knew was suboptimal when I implemented it, I've just had the opportunity to fix it today.

@AThousandShips
Copy link
Member

AThousandShips commented Nov 13, 2025

Then the benchmarks should show it, including running profiling on the relevant methods affected by this, it's not no work at all but it's not a lot to just confirm this and confirm that the other considerations and potential other performance aspects are not a problem, I get it but I've seen plenty of very obvious optimizations go away when running release builds and encountering the real world

@groud
Copy link
Member Author

groud commented Nov 13, 2025

Anyway, reran the benchmark in release. In fact, it's more like a x20 improvement:

Before: ~9ms
After: ~0.6ms

@Ivorforce
Copy link
Member

That sounds pretty realistic! There's of course still a possibility that some of the implementation was optimized away due to the constants used in your benchmark, but for me this is along the lines of what I would expect from this change. So I think this suffices as a proof of optimization.
Thanks again!

@groud
Copy link
Member Author

groud commented Nov 13, 2025

Note that, for now, the optimization is not exposed to users, as bresenham_line still pushes everything to a vector anyway. It should be doable to expose though, as we could expose the Bresenham class as a custom iterator. It's not a common pattern in the API though.

I can do it in another PR if we feel the performance improvement is worth.

@groud groud force-pushed the optimize_bresenham branch from 4b244bc to ff55c33 Compare November 13, 2025 16:54
@groud
Copy link
Member Author

groud commented Nov 13, 2025

As discussed, I've pushed a change to improve the readability by adding a bresenham variable where it made sense.

@akien-mga
Copy link
Member

See #105292 which was just merged and might benefit from the same change.

@groud
Copy link
Member Author

groud commented Nov 14, 2025

I'm having a second thought about the implementation. While I think the for( : ) syntax is nice, I kind of hate the fact we have to define two classes for it. I am thinking that, maybe, we can get a simpler implementation with only one simpler class.

Maybe something that would need to be used that way though:

for (Bresenham b = Bresenham(1000,500); b.is_end(); b.next()) { 
    Vector2i point = b.value(); 
 }

What do you think?

@Ivorforce
Copy link
Member

Ivorforce commented Nov 14, 2025

I'm having a second thought about the implementation. While I think the for( : ) syntax is nice, I kind of hate the fact we have to define two classes for it. I am thinking that, maybe, we can get a simpler implementation with only one simpler class.

Maybe something that would need to be used that way though:

for (Bresenham b = Bresenham(1000,500); b.is_end(); b.next()) { 
    Vector2i point = b.value(); 
 }

What do you think?

If you want to avoid using two classes, I would prefer the following:

Iterable<BresenhamIterator> bresenham_iter(Vector2i p_from, Vector2i p_to) {
    return Iterable<BresenhamIterator>(BresenhamIterator(p_from, p_to), BresenhamIterator());
}

I introduced Iterable basically for the purpose of "I have two iterator instances, and I want to use C++ syntax to iterate across them". Which would be your use-case here, I think.

@groud
Copy link
Member Author

groud commented Nov 14, 2025

I introduced Iterable basically for the purpose of "I have two iterator instances, and I want to use C++ syntax to iterate across them". Which would be your use-case here, I think.

Oooh, I has no clue we had that. Yeah that seems like a good plan, I'll have a look.

@Ivorforce
Copy link
Member

Ivorforce commented Nov 14, 2025

Actually, I say that with the expectation of continuing to use the C++ iterator syntax.
C++ iteration syntax is actually really weird that it requires a begin and end. This doesn't make much sense for bresenham, so maybe your proposed solution would be better (fewer wasted variables).
I wanted to look into ways to avoid this problem for some time, but haven't gotten around to it yet.

@groud
Copy link
Member Author

groud commented Nov 14, 2025

Actually, I say that with the expectation of continuing to use the C++ iterator syntax. C++ iteration syntax is actually really weird that it requires a begin and end. This doesn't make much sense for bresenham, so maybe your proposed solution would be better (fewer wasted variables).

Yeah I do agree. I think an additional variable is fine, the main problem IMO is that the two nested classes make the code quite hard to read, and it's a bit too much additional code to avoid doing an allocation. So if I can at least shrink it a bit that would be nice.

@groud groud force-pushed the optimize_bresenham branch from ff55c33 to 480398f Compare November 14, 2025 11:28
@groud
Copy link
Member Author

groud commented Nov 14, 2025

Alright, updated the code, we went from 59 added LoC to 35, and we have a single class now. I think it looks better.

@groud groud force-pushed the optimize_bresenham branch from 480398f to 0da7c5f Compare November 14, 2025 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants