Skip to content

Conversation

@illwieckz
Copy link
Member

@illwieckz illwieckz commented Sep 30, 2025

This is a proof of concept of multithreaded model CPU computation, for when using r_vboModel 0 or when some low-end hardware limitations forces the engine to use that code.

When testing the same scene on the same computer with the same environment:

Before After
91fps 388fps

At first I was looking for processing all models in parallel, running all Tess_SurfaceIQM() calls in parallel, with Tess_SurfaceIQM() being un modified, but then I discovered we process models like if they were surfaces, and then we not only compute models one after one (something I knew) but we in fact compute and render every model one after one and then it's probably hard to break that sequence.

So I attempted to multithread the Tess_SurfaceIQM() call itself. Here is a proof of concept that works.

It has the drawback of requiring OpenMP and then to link against libgomp and this adds a dependency.

Also right now I use some GCC builtin to only parallelize that code, as when I use std::foreach( std::execution::par, … ) it is still sequential even if I link against OpenMP (maybe I'm missing something?).

I also have a branch that builds with -D_GLIBCXX_PARALLEL that enforces GCC to automatically replace all parallelizable standard features (like std::sort) with parallel variants, but this requires many workarounds because it's not 100% compatible with our code. I wrote the workarounds by the way, but those workarounds use some GCC intrinsics as well. I can push that branch if someone is interested.

I also have another branch that splits the model into chunks, meaning the for_each call only iterates (or supposedly multithread) chunks of the model. I was hoping that would help to get better performance because dealing with threads can be time consuming and I was hoping to better control how it multithreads things this way, but curiously OpenMP only spawns one thread when I do this. This alternate implementation may be interesting to us anyway because then all we would need to avoid the dependency on OpenMP would be to write our own dispatch. I can push that branch as well.

Something very ugly in that PoC is the stupid HACK i use to get an iterator that returns the index. That's a PoC anyway.

@illwieckz illwieckz marked this pull request as draft September 30, 2025 17:53
@illwieckz illwieckz force-pushed the illwieckz/poc-multithread-cpu-model-gnu branch from 7a9a561 to debbf30 Compare September 30, 2025 18:04
@illwieckz
Copy link
Member Author

You can ignore the CI being unhappy, it's just a PoC I share so we can discuss it.

@slipher
Copy link
Member

slipher commented Sep 30, 2025

That's an impressive FPS statistic. How many threads did it use?

Someone says that you have to add in some library (which isn't GOMP) when linking to activate the parallelism in standard C++17.

also have a branch that builds with -D_GLIBCXX_PARALLEL that enforces GCC to automatically replace all parallelizable standard features (like std::sort) with parallel variants, but this requires many workarounds because it's not 100% compatible with our code

That sounds unpleasant to deal with. I don't want my code to be suddenly multithreaded when I didn't ask!

Make sure to keep C++14 for all gamelogic build targets. It would be annoying to accidentally use C++17 features developing against the DLL gamelogic, but then find out they are not allowed on the CI. Also if GOMP is used rather than the C++17 standard stuff, I bet that you don't need C++17 at all.

Also right now I use some GCC builtin to only parallelize that code,

That's not a builtin, just a namespace for the GOMP-based STL algorithm implementations.

I also have another branch that splits the model into chunks, meaning the for_each call only iterates (or supposedly multithread) chunks of the model. I was hoping that would help to get better performance because dealing with threads can be time consuming and I was hoping to better control how it multithreads things this way, but curiously OpenMP only spawns one thread when I do this.

You mean it uses a small number of large pieces of work, instead of a large number of small ones? In that case, I imagine that it has some fixed estimate of the size of each piece of work, and reasons that if the number is small, it is a waste of time to start up the threads. You would probably need to give it some hint for it to know that each iteration will take a long time. In any case, I guess the thread management must be pretty efficient if you got that much speedup.

@illwieckz
Copy link
Member Author

illwieckz commented Sep 30, 2025

That's an impressive FPS statistic. How many threads did it use?

I used 16 threads on a 32 threads COU. Of course that's a bit stupid because machines using that code by default are likely to have less threads, but doing that shows very well that it actually matters to multithread that code, and that the performance actually scales with threads.

Also it is good to know that starting with some amount of threads, the thread management is so heavy that it starts slowing things down, to a point it becomes even slower than without threading, which is expected with a lot of threads doing small pieces of work. Also it puts the computer on knees for doing nothing.

threads 1 2 4 6 8 10 12 14 16 18 24 28 32
framerate (fps) 91 150 237 292 323 341 350 361 388 344 257 113 2

With -D_GLIBCXX_PARALLEL and std::for_each( std::execution::par, … ) with 32 threads I get ~90 fps like with a single thread (but the computer is on knee), while with __gnu_parallel::for_each() it just goes next to ~1 fps…

That sounds unpleasant to deal with. I don't want my code to be suddenly multithreaded when I didn't ask!

Of cours, and in fact that's the reason why I then turned to __gnu_parallel::for_each() and why I cap the amount of threads, because then everything being super slow because it spawns 32 threads on my end for everything. Even the pak loading would spawn 32 threads on my and and is super slow when doing so.

That's not a builtin, just a namespace for the GOMP-based STL algorithm implementations.

Possible, but that's still GNU-specific, and that is probably painful to get with MSVC.

Someone says that you have to add in some library (which isn't GOMP) when linking to activate the parallelism in standard C++17.

I tried with PBB but didn't got it working yet (it builds, but my code still runs sequentially). It's also possible I did it right with PBB but forgot something else. That doesn't solve the problem of adding a dependency though.

You mean it uses a small number of large pieces of work, instead of a large number of small ones?

Yes.

You would probably need to give it some hint for it to know that each iteration will take a long time. In any case, I guess the thread management must be pretty efficient if you got that much speedup.

I suspect it doesn't start a thread per vertex or it would be much awful, or maybe it actually does when I don't cap the thread count and it builds some batches when I cap the thread count.

I want to get control on the batching to test how it performs on machines with few threads, like dual cores or quad cores CPUs. When testing that branch on low-end machines with only a few cores the framerate bump is not that good, but at least something like a 5 fps bump is good for going from 25 fps to 30 fps for example. That's always good to get and can turn some machine test results from slow to playable, or from playable to passed.

@illwieckz illwieckz force-pushed the illwieckz/poc-multithread-cpu-model-gnu branch 4 times, most recently from eb4f225 to 3731e39 Compare October 1, 2025 00:25
@illwieckz
Copy link
Member Author

The experiment was a success. I close this and will submit a completed and cleaned-up branch later.

@illwieckz illwieckz closed this Oct 1, 2025
@illwieckz illwieckz deleted the illwieckz/poc-multithread-cpu-model-gnu branch October 1, 2025 19:33
@illwieckz
Copy link
Member Author

Also a side note:

Also if GOMP is used rather than the C++17 standard stuff, I bet that you don't need C++17 at all.

Yes the C++17 thing was from a previous unpublished experiment attempting to use std::for_each( std::execution::par, … ) with libtbb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants