PoC: multithreaded model CPU computation #1833

illwieckz · 2025-09-30T17:52:59Z

This is a proof of concept of multithreaded model CPU computation, for when using r_vboModel 0 or when some low-end hardware limitations forces the engine to use that code.

When testing the same scene on the same computer with the same environment:

Before	After
`91fps`	`388fps`

At first I was looking for processing all models in parallel, running all Tess_SurfaceIQM() calls in parallel, with Tess_SurfaceIQM() being un modified, but then I discovered we process models like if they were surfaces, and then we not only compute models one after one (something I knew) but we in fact compute and render every model one after one and then it's probably hard to break that sequence.

So I attempted to multithread the Tess_SurfaceIQM() call itself. Here is a proof of concept that works.

It has the drawback of requiring OpenMP and then to link against libgomp and this adds a dependency.

Also right now I use some GCC builtin to only parallelize that code, as when I use std::foreach( std::execution::par, … ) it is still sequential even if I link against OpenMP (maybe I'm missing something?).

I also have a branch that builds with -D_GLIBCXX_PARALLEL that enforces GCC to automatically replace all parallelizable standard features (like std::sort) with parallel variants, but this requires many workarounds because it's not 100% compatible with our code. I wrote the workarounds by the way, but those workarounds use some GCC intrinsics as well. I can push that branch if someone is interested.

I also have another branch that splits the model into chunks, meaning the for_each call only iterates (or supposedly multithread) chunks of the model. I was hoping that would help to get better performance because dealing with threads can be time consuming and I was hoping to better control how it multithreads things this way, but curiously OpenMP only spawns one thread when I do this. This alternate implementation may be interesting to us anyway because then all we would need to avoid the dependency on OpenMP would be to write our own dispatch. I can push that branch as well.

Something very ugly in that PoC is the stupid HACK i use to get an iterator that returns the index. That's a PoC anyway.

illwieckz · 2025-09-30T18:04:50Z

You can ignore the CI being unhappy, it's just a PoC I share so we can discuss it.

slipher · 2025-09-30T18:46:29Z

That's an impressive FPS statistic. How many threads did it use?

Someone says that you have to add in some library (which isn't GOMP) when linking to activate the parallelism in standard C++17.

also have a branch that builds with -D_GLIBCXX_PARALLEL that enforces GCC to automatically replace all parallelizable standard features (like std::sort) with parallel variants, but this requires many workarounds because it's not 100% compatible with our code

That sounds unpleasant to deal with. I don't want my code to be suddenly multithreaded when I didn't ask!

Make sure to keep C++14 for all gamelogic build targets. It would be annoying to accidentally use C++17 features developing against the DLL gamelogic, but then find out they are not allowed on the CI. Also if GOMP is used rather than the C++17 standard stuff, I bet that you don't need C++17 at all.

Also right now I use some GCC builtin to only parallelize that code,

That's not a builtin, just a namespace for the GOMP-based STL algorithm implementations.

I also have another branch that splits the model into chunks, meaning the for_each call only iterates (or supposedly multithread) chunks of the model. I was hoping that would help to get better performance because dealing with threads can be time consuming and I was hoping to better control how it multithreads things this way, but curiously OpenMP only spawns one thread when I do this.

You mean it uses a small number of large pieces of work, instead of a large number of small ones? In that case, I imagine that it has some fixed estimate of the size of each piece of work, and reasons that if the number is small, it is a waste of time to start up the threads. You would probably need to give it some hint for it to know that each iteration will take a long time. In any case, I guess the thread management must be pretty efficient if you got that much speedup.

illwieckz · 2025-09-30T20:39:21Z

That's an impressive FPS statistic. How many threads did it use?

I used 16 threads on a 32 threads COU. Of course that's a bit stupid because machines using that code by default are likely to have less threads, but doing that shows very well that it actually matters to multithread that code, and that the performance actually scales with threads.

Also it is good to know that starting with some amount of threads, the thread management is so heavy that it starts slowing things down, to a point it becomes even slower than without threading, which is expected with a lot of threads doing small pieces of work. Also it puts the computer on knees for doing nothing.

threads	1	2	4	6	8	10	12	14	16	18	24	28	32
framerate (fps)	91	150	237	292	323	341	350	361	388	344	257	113	2

With -D_GLIBCXX_PARALLEL and std::for_each( std::execution::par, … ) with 32 threads I get ~90 fps like with a single thread (but the computer is on knee), while with __gnu_parallel::for_each() it just goes next to ~1 fps…

That sounds unpleasant to deal with. I don't want my code to be suddenly multithreaded when I didn't ask!

Of cours, and in fact that's the reason why I then turned to __gnu_parallel::for_each() and why I cap the amount of threads, because then everything being super slow because it spawns 32 threads on my end for everything. Even the pak loading would spawn 32 threads on my and and is super slow when doing so.

That's not a builtin, just a namespace for the GOMP-based STL algorithm implementations.

Possible, but that's still GNU-specific, and that is probably painful to get with MSVC.

Someone says that you have to add in some library (which isn't GOMP) when linking to activate the parallelism in standard C++17.

I tried with PBB but didn't got it working yet (it builds, but my code still runs sequentially). It's also possible I did it right with PBB but forgot something else. That doesn't solve the problem of adding a dependency though.

You mean it uses a small number of large pieces of work, instead of a large number of small ones?

Yes.

You would probably need to give it some hint for it to know that each iteration will take a long time. In any case, I guess the thread management must be pretty efficient if you got that much speedup.

I suspect it doesn't start a thread per vertex or it would be much awful, or maybe it actually does when I don't cap the thread count and it builds some batches when I cap the thread count.

I want to get control on the batching to test how it performs on machines with few threads, like dual cores or quad cores CPUs. When testing that branch on low-end machines with only a few cores the framerate bump is not that good, but at least something like a 5 fps bump is good for going from 25 fps to 30 fps for example. That's always good to get and can turn some machine test results from slow to playable, or from playable to passed.

illwieckz · 2025-10-01T19:33:38Z

The experiment was a success. I close this and will submit a completed and cleaned-up branch later.

illwieckz · 2025-10-01T20:04:46Z

Also a side note:

Also if GOMP is used rather than the C++17 standard stuff, I bet that you don't need C++17 at all.

Yes the C++17 thing was from a previous unpublished experiment attempting to use std::for_each( std::execution::par, … ) with libtbb.

illwieckz added 3 commits September 30, 2025 01:14

cmake: enable C++17 in engine

593e66d

cmake: enable OpenMP in engine

66bac18

PoC: multithreaded model CPU computation

3731e39

illwieckz marked this pull request as draft September 30, 2025 17:53

illwieckz force-pushed the illwieckz/poc-multithread-cpu-model-gnu branch from 7a9a561 to debbf30 Compare September 30, 2025 18:04

illwieckz force-pushed the illwieckz/poc-multithread-cpu-model-gnu branch 4 times, most recently from eb4f225 to 3731e39 Compare October 1, 2025 00:25

illwieckz mentioned this pull request Oct 1, 2025

PoC: multithreaded chunked model CPU computation #1837

Closed

illwieckz closed this Oct 1, 2025

illwieckz deleted the illwieckz/poc-multithread-cpu-model-gnu branch October 1, 2025 19:33

illwieckz mentioned this pull request Oct 2, 2025

multithread the MD5/IQM model CPU code using OpenMP #1838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PoC: multithreaded model CPU computation #1833

PoC: multithreaded model CPU computation #1833

Uh oh!

illwieckz commented Sep 30, 2025 •

edited

Loading

Uh oh!

illwieckz commented Sep 30, 2025

Uh oh!

slipher commented Sep 30, 2025

Uh oh!

illwieckz commented Sep 30, 2025 •

edited

Loading

Uh oh!

illwieckz commented Oct 1, 2025

Uh oh!

illwieckz commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PoC: multithreaded model CPU computation #1833

PoC: multithreaded model CPU computation #1833

Uh oh!

Conversation

illwieckz commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

illwieckz commented Sep 30, 2025

Uh oh!

slipher commented Sep 30, 2025

Uh oh!

illwieckz commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

illwieckz commented Oct 1, 2025

Uh oh!

illwieckz commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

illwieckz commented Sep 30, 2025 •

edited

Loading

illwieckz commented Sep 30, 2025 •

edited

Loading