-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta] Dota performance issues on Metal #2161
Comments
Other ideas:
|
2164: [mtl] Borrowed commands r=grovesNL a=kvark PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: r? @gfx-rs/metallists This PR attempts to have lightweight software commands that don't take any heap space or own ObjC objects. In most cases, where a command list is live-recorded and executed once, this should reduce the amount of work we do per command, which is especially important if those commands are thrown away (e.g. because we are not inside a render pass). My expectation would be to see an improvement in #2161 due to us doing less work. The actual results are somewhat shocking: with v-sync enabled I'm getting the same 59-60 fps as usual. With v-sync OFF, I'm getting between 25 and 50 fps now (which is lower than the previous 50-70). Not sure what's going on, the instrumental profile doesn't give a clue. Please check out the code. Co-authored-by: Dzmitry Malyshau <[email protected]>
2164: [mtl] Borrowed commands r=grovesNL a=kvark PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: r? @gfx-rs/metallists This PR attempts to have lightweight software commands that don't take any heap space or own ObjC objects. In most cases, where a command list is live-recorded and executed once, this should reduce the amount of work we do per command, which is especially important if those commands are thrown away (e.g. because we are not inside a render pass). My expectation would be to see an improvement in #2161 due to us doing less work. The actual results are somewhat shocking: with v-sync enabled I'm getting the same 59-60 fps as usual. With v-sync OFF, I'm getting between 25 and 50 fps now (which is lower than the previous 50-70). Not sure what's going on, the instrumental profile doesn't give a clue. Please check out the code. Co-authored-by: Dzmitry Malyshau <[email protected]>
2071: Remaping descriptor sets in the gl backend r=kvark a=ZeGentzy I'll rebase off master when I'm done. Uniforms in gl only have a bindings field, not a set one. This means that for shaders that use multiple sets to work, we must change where we are binding them. See page 14 for what I mean: https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/4-Using-spir-v-with-spirv-cross.pdf PR checklist: - [ ] `make` succeeds (on *nix) - [ ] `make reftests` succeeds - [ ] tested examples with the following backends: 2164: [mtl] Borrowed commands r=grovesNL a=kvark PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: r? @gfx-rs/metallists This PR attempts to have lightweight software commands that don't take any heap space or own ObjC objects. In most cases, where a command list is live-recorded and executed once, this should reduce the amount of work we do per command, which is especially important if those commands are thrown away (e.g. because we are not inside a render pass). My expectation would be to see an improvement in #2161 due to us doing less work. The actual results are somewhat shocking: with v-sync enabled I'm getting the same 59-60 fps as usual. With v-sync OFF, I'm getting between 25 and 50 fps now (which is lower than the previous 50-70). Not sure what's going on, the instrumental profile doesn't give a clue. Please check out the code. 2166: Update example instructions in README.md r=kvark a=king6cong Co-authored-by: Hal Gentz <[email protected]> Co-authored-by: Dzmitry Malyshau <[email protected]> Co-authored-by: king6cong <[email protected]>
2164: [mtl] Borrowed commands r=grovesNL a=kvark PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: r? @gfx-rs/metallists This PR attempts to have lightweight software commands that don't take any heap space or own ObjC objects. In most cases, where a command list is live-recorded and executed once, this should reduce the amount of work we do per command, which is especially important if those commands are thrown away (e.g. because we are not inside a render pass). My expectation would be to see an improvement in #2161 due to us doing less work. The actual results are somewhat shocking: with v-sync enabled I'm getting the same 59-60 fps as usual. With v-sync OFF, I'm getting between 25 and 50 fps now (which is lower than the previous 50-70). Not sure what's going on, the instrumental profile doesn't give a clue. Please check out the code. 2166: Update example instructions in README.md r=kvark a=king6cong Co-authored-by: Dzmitry Malyshau <[email protected]> Co-authored-by: king6cong <[email protected]>
2164: [mtl] Borrowed commands r=grovesNL a=kvark PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: r? @gfx-rs/metallists This PR attempts to have lightweight software commands that don't take any heap space or own ObjC objects. In most cases, where a command list is live-recorded and executed once, this should reduce the amount of work we do per command, which is especially important if those commands are thrown away (e.g. because we are not inside a render pass). My expectation would be to see an improvement in #2161 due to us doing less work. The actual results are somewhat shocking: with v-sync enabled I'm getting the same 59-60 fps as usual. With v-sync OFF, I'm getting between 25 and 50 fps now (which is lower than the previous 50-70). Not sure what's going on, the instrumental profile doesn't give a clue. Please check out the code. Co-authored-by: Dzmitry Malyshau <[email protected]>
So getting counts of calls to objc_msgSend turned out to be more difficult than I thought. I started with lldb which was very, very slow. I switched to gdb, had to fix gdb and then it was still too slow. Finally I settled on using dtrace, which ran at a reasonable speed. My results are: |
@jrmuizel wow, quite a big difference! How did you create this count with dtrace? Is there a way we could break it down by object name/selector/call stack to determine the difference? |
Here's a per selector breakdown. MoltenVK
Gfx:
And the dtrace script to produce it
|
2164: [mtl] Borrowed commands r=grovesNL a=kvark PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: r? @gfx-rs/metallists This PR attempts to have lightweight software commands that don't take any heap space or own ObjC objects. In most cases, where a command list is live-recorded and executed once, this should reduce the amount of work we do per command, which is especially important if those commands are thrown away (e.g. because we are not inside a render pass). My expectation would be to see an improvement in #2161 due to us doing less work. The actual results are somewhat shocking: with v-sync enabled I'm getting the same 59-60 fps as usual. With v-sync OFF, I'm getting between 25 and 50 fps now (which is lower than the previous 50-70). Not sure what's going on, the instrumental profile doesn't give a clue. Please check out the code. Co-authored-by: Dzmitry Malyshau <[email protected]>
2164: [mtl] Borrowed commands r=grovesNL a=kvark PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: r? @gfx-rs/metallists This PR attempts to have lightweight software commands that don't take any heap space or own ObjC objects. In most cases, where a command list is live-recorded and executed once, this should reduce the amount of work we do per command, which is especially important if those commands are thrown away (e.g. because we are not inside a render pass). My expectation would be to see an improvement in #2161 due to us doing less work. The actual results are somewhat shocking: with v-sync enabled I'm getting the same 59-60 fps as usual. With v-sync OFF, I'm getting between 25 and 50 fps now (which is lower than the previous 50-70). Not sure what's going on, the instrumental profile doesn't give a clue. Please check out the code. Co-authored-by: Dzmitry Malyshau <[email protected]>
I took a look at where the majority of the
Calling |
For comparison here's what it looks like with MoltenVK
|
I did something similar last night (just logging all https://gist.github.com/grovesNL/80453d146327c87432c30191f3b1f579 You can generate this output in Xcode with a breakpoint on Most time seems to be retain/release on textures/samplers/buffers. This seems mostly due to how we use retain/release to match Rust ownership semantics, but we could bypass most of that and use raw pointers, or some other way. |
Here are stack traces of where the majority of the texture retain/releases are happening from 23 frames:
A big chunk of them are from pre_render_commands() and add_textures(). It feels like we're cloning them way more than we need to. |
Thanks @jrmuizel! I think we'll try to avoid these clones entirely by working with the pointer of the native object directly and upgrading them (probably transmute) to |
We should also consider disabling all interactions with the capture manager in release, or hide behind a feature. |
2175: mtl: Use pointers for temporary state r=kvark a=grovesNL Partially fix to #2161 PR checklist: - [ ] `make` succeeds (on *nix) - [ ] `make reftests` succeeds - [ ] tested examples with the following backends: This replaces buffer/texture/sampler object usage with pointers instead. This allows us to bypass Objective-C's reference counting (i.e. retain/release). The idea is to create the same objects as before, but purposely leak them and acquire a pointer to them instead. When we're finished with the objects, we take back ownership and release as usual. We should be able to do this either through `drop` on the owning struct, or through relevant `destroy_*` functions. Still heavily WIP (but works with quad/Dota 2) because there are a few places that will cause leaks or use-after-free in its current state (need to fix cases where the same image is returned for an image view, verify `drop` exists on all the correct owners, etc.) Co-authored-by: Joshua Groves <[email protected]>
Dota also attempts to use |
I tried to reduce the capacity and didn't see any visible impact. Created #2180 to possibly address this properly in the future. |
With all the optimization coming (e.g. #2183) the heaviest call stack that Instruments show is about creating a render pass. We do this on Vulkan render passes of course, but also quite a few times for the |
2183: Heap-less descriptor sets in Metal r=grovesNL a=kvark Fixes the heap allocations in descriptor sets, to help #2161. Edit: the improvement is there, but it's not the one we were looking for. Adds real value semantics for the `DescriptorPool`, which now owns the allocation and has the actual descriptor sets pointing to it. PR checklist: - [x] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: metal ~~Note: it doesn't work correctly yet, but I'd be happy to get the review comments and notes. Perhaps, you can spot the mistake? ;) Sadly, my CTS is non-operational atm, so it's not easy to figure out a test case. Note2: the individual commits are not build-able. Will squash once finished.~~ Co-authored-by: Dzmitry Malyshau <[email protected]>
2185: Various Metal performance optimizations r=grovesNL a=kvark Helps #2161 . I'm now getting 80-85 fps on the test run. Aside from Metal, also changes HAL to avoid heap allocation for vertex buffer binding. PR checklist: - [x] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: metal Co-authored-by: Dzmitry Malyshau <[email protected]>
Just updated a bunch of items in the issue body.
Those counts are already provided by @jrmuizel . We call it 129 times versus 108 in Molten. Still a difference to investigate, but not one to be too worried about.
I clarified the threading model now, and there appears to be no good reason for us to be CPU bound, on a main thread where we only run 1/3 of time... Dota just needs to use more threads for the job execution.
Some of it is addressed in #2185. Iterators are fine where they are applicable. In other places we have to resort to
With the actual threading model clarified, I now see ~5% time spent by MoltenVK in |
Here is what our frame looks like when in immediate recording mode: Some observations:
|
Ran Vulkan Dota 2 on Windows using RGP (trace file) to compare: |
2264: Render pass descriptor cache for Metal r=grovesNL a=kvark ~~Includes #2260~~ Fixes two of the performance issues in #2161 (RP desc locking and copying costs). Immediate recording FPS doesn't seem to change, maybe slightly lower (touching 90 from below more than from above, as I recall it doing - need to confirm. Edit - confirmed to not be caused by the PR). Deferred recording FPS seem to go from lower 100s to higher, or even from 100 to 110, roughly speaking. There are barely any bottlenecks left for it, outside of the general architecture. Main thread now spends about 14.5% in our code, which is mostly covered by driver interaction. PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: Metal - [ ] `rustfmt` run on changed code Co-authored-by: Dzmitry Malyshau <[email protected]>
I believe this can be closed now, our results were published on the blog |
Analyzing the Instruments profiles reveals quite a few things we could do better.
Native API calls
Too many ObjectiveC messages. Roughly, our profile shows 5-7% dedicated toobjc_msgSend
versus 0.3-0.5% in MoltenVK. The difference in retain/release calls is of a similar scale.retain
/release
calls - mtl: Use pointers for temporary state #2175Attribute access goes through messages - Metal direct property access #2167Class::get
calls are super slow - CacheClass::get
SSheldon/rust-objc#65 . We spend roughly 3% of our library's time doing those calls.Metal backend issues
Portability issues
Application concerns
This is something Dota should fix or expose for us to test.Things to investigate
The text was updated successfully, but these errors were encountered: