-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal enqueue() advantage #2232
Comments
Shouldn't applications that uses Vulkan submit command buffers as soon as they are recorded? I have no idea what Dota records in thousands of command buffers.
|
I looked a bit more at the Metal System Trace numbers, added more stats to our backend locally to estimate the total pipeline length, and the numbers now started to match up and make more sense to me. In an average frame, Dota submits 34 immediate command buffers in two fenced batches (thus, 2 completion handlers and 2 extra temporary command buffers). The total active command buffers in our pool is around 880, from which we can conclude the pipeline length to roughly be 22 frames.
|
As an experiment, I forced deferred command buffer recording in gfx-portability, and got a steady 15% boost of performance (from 72 to 85). Metal system trace show the GPU work nicely overlapping with recording, as predicted in the subject of this issue. We might want to focus on optimizing this path a bit, if it's so beneficial for some cases. Interestingly, there are short periods (~0.5 sec) where the FPS jumps up significantly (not single frames). Perhaps, what we are limited here is some sort of contention on device access (e.g. for descriptor allocation and writing) that just goes away at times because dota figures out to re-use existing descriptors. |
Looking at Metal System Trace, I realized the real value of enqueuing the command buffers earlier than submitting it. It's not documented much, but the behavior change of the driver is drastic.
Theory
When a command buffer is enqueued, once we call
endEncoding
on a pass, it gets instantly passed down to the driver, since it knows it doesn't need to wait for anything and is expecting the work. Consequently, the GPU starts chewing on the work right away. Thus, thecommit()
becomes a simple message saying "I'm done with this", leaving the submission queue for other things to use.Basically, I call this paragraph BS:
Now, let's look at what happens if we don't enqueue anything explicitly:
Sounds pretty harmless, doesn't it? Well, what really happens is that the driver doesn't want to do anything with our encoded passes until the command buffer gets committed. The passes get stacked on the command buffer internally (much like our software commands) and then dropped like a bomb to the driver upon
commit()
call.Practice
Let's get more concrete. Suppose gfx-portability spends X amount of time recording an application's (one-time) command buffer. The driver does some work, but it's able to propagate it to GPU gradually, and doesn't take longer than the GPU itself, so we should only take into account it's latency L. And the GPU takes G amount of time to finish the work on those commands. Let's see how the work flows:
Total time: X + L + G
Encoding thread time: X
Submission thread time: ~0
Now, let's look at MoltenVK:
Total time: 0.5X + max(0.75X, L + G)
Encoding thread time: 0.5X
Submission thread time: 0.75X
See what happened here? There is more work in total, but it's spread over threads, and actually completes faster because the GPU gets stuff to work on earlier. Now X here can be logically extended to the total recording time (instead of a single command buffer), given that it's the submission cut-off that matters, and you can see how this can drastically affect performance in the end.
Solutions
In an ideal world, Vulkan would have some sort of API to tell the driver (earlier than at submission time) which order the on-time encoded command buffers are going to be submitted in. This isn't going to happen though.
A more practical alternative would be to try forcing the deferred command buffer recording on our side, and see how this affects frame scheduling. This would technically zero out one of our major advantages, and it will be a race over whose software command buffers are lighter.
Finally, at the application side, we'd benefit from more granular submissions. Dota makes about a thousand command buffers, but only submits them in 2 chunks per frame. So we are being delayed by roughly a quarter of the frame time here.
The text was updated successfully, but these errors were encountered: