-
Notifications
You must be signed in to change notification settings - Fork 66
PoC: multithreaded model CPU computation #1833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7a9a561 to
debbf30
Compare
|
You can ignore the CI being unhappy, it's just a PoC I share so we can discuss it. |
|
That's an impressive FPS statistic. How many threads did it use? Someone says that you have to add in some library (which isn't GOMP) when linking to activate the parallelism in standard C++17.
That sounds unpleasant to deal with. I don't want my code to be suddenly multithreaded when I didn't ask! Make sure to keep C++14 for all gamelogic build targets. It would be annoying to accidentally use C++17 features developing against the DLL gamelogic, but then find out they are not allowed on the CI. Also if GOMP is used rather than the C++17 standard stuff, I bet that you don't need C++17 at all.
That's not a builtin, just a namespace for the GOMP-based STL algorithm implementations.
You mean it uses a small number of large pieces of work, instead of a large number of small ones? In that case, I imagine that it has some fixed estimate of the size of each piece of work, and reasons that if the number is small, it is a waste of time to start up the threads. You would probably need to give it some hint for it to know that each iteration will take a long time. In any case, I guess the thread management must be pretty efficient if you got that much speedup. |
I used 16 threads on a 32 threads COU. Of course that's a bit stupid because machines using that code by default are likely to have less threads, but doing that shows very well that it actually matters to multithread that code, and that the performance actually scales with threads. Also it is good to know that starting with some amount of threads, the thread management is so heavy that it starts slowing things down, to a point it becomes even slower than without threading, which is expected with a lot of threads doing small pieces of work. Also it puts the computer on knees for doing nothing.
With
Of cours, and in fact that's the reason why I then turned to
Possible, but that's still GNU-specific, and that is probably painful to get with MSVC.
I tried with PBB but didn't got it working yet (it builds, but my code still runs sequentially). It's also possible I did it right with PBB but forgot something else. That doesn't solve the problem of adding a dependency though.
Yes.
I suspect it doesn't start a thread per vertex or it would be much awful, or maybe it actually does when I don't cap the thread count and it builds some batches when I cap the thread count. I want to get control on the batching to test how it performs on machines with few threads, like dual cores or quad cores CPUs. When testing that branch on low-end machines with only a few cores the framerate bump is not that good, but at least something like a 5 fps bump is good for going from 25 fps to 30 fps for example. That's always good to get and can turn some machine test results from |
eb4f225 to
3731e39
Compare
|
The experiment was a success. I close this and will submit a completed and cleaned-up branch later. |
|
Also a side note:
Yes the C++17 thing was from a previous unpublished experiment attempting to use |
This is a proof of concept of multithreaded model CPU computation, for when using
r_vboModel 0or when some low-end hardware limitations forces the engine to use that code.When testing the same scene on the same computer with the same environment:
91fps388fpsAt first I was looking for processing all models in parallel, running all
Tess_SurfaceIQM()calls in parallel, withTess_SurfaceIQM()being un modified, but then I discovered we process models like if they were surfaces, and then we not only compute models one after one (something I knew) but we in fact compute and render every model one after one and then it's probably hard to break that sequence.So I attempted to multithread the
Tess_SurfaceIQM()call itself. Here is a proof of concept that works.It has the drawback of requiring OpenMP and then to link against
libgompand this adds a dependency.Also right now I use some GCC builtin to only parallelize that code, as when I use
std::foreach( std::execution::par, … )it is still sequential even if I link against OpenMP (maybe I'm missing something?).I also have a branch that builds with
-D_GLIBCXX_PARALLELthat enforces GCC to automatically replace all parallelizable standard features (likestd::sort) with parallel variants, but this requires many workarounds because it's not 100% compatible with our code. I wrote the workarounds by the way, but those workarounds use some GCC intrinsics as well. I can push that branch if someone is interested.I also have another branch that splits the model into chunks, meaning the
for_eachcall only iterates (or supposedly multithread) chunks of the model. I was hoping that would help to get better performance because dealing with threads can be time consuming and I was hoping to better control how it multithreads things this way, but curiously OpenMP only spawns one thread when I do this. This alternate implementation may be interesting to us anyway because then all we would need to avoid the dependency on OpenMP would be to write our own dispatch. I can push that branch as well.Something very ugly in that PoC is the stupid HACK i use to get an iterator that returns the index. That's a PoC anyway.