Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote command sink in Metal #2260

Merged
merged 5 commits into from
Jul 24, 2018
Merged

Remote command sink in Metal #2260

merged 5 commits into from
Jul 24, 2018

Conversation

kvark
Copy link
Member

@kvark kvark commented Jul 21, 2018

Fixes #2259
The results so far are not super promising - highly unstable (presumably, because of the dispatch), with performance around Immediate mark. We are still missing the most important follow-up here - to avoid any heap allocations when recording commands. Currently, it just goes with Vec::new() and grows it for each pass, which shows up in the profile quite a bit.

The PR also has a bunch of stuff in general optimizations:

  • HAL change in the descriptor allocation API to avoid the heap
  • lighten up Metal descriptor binding path (a bit) by making sure there is enough state slots in advance
  • simplification and refactoring of CommandSink implementations

PR checklist:

  • make succeeds (on *nix)
  • make reftests succeeds
  • tested examples with the following backends: Metal
  • rustfmt run on changed code

@kvark kvark requested a review from grovesNL July 21, 2018 03:50
.map(|sets| {
output.extend(sets
.into_iter()
.zip(layout_bindinds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change isn't part of this PR, but layout_bindinds is a typo here

}

/// Allocate one or multiple descriptor sets from the pool.
///
/// Each descriptor set will be allocated from the pool according to the corresponding set layout.
/// Descriptors will become invalid once the pool is reset. Usage of invalidated descriptor sets results
/// in undefined behavior.
fn allocate_sets<I>(&mut self, layouts: I) -> Vec<Result<B::DescriptorSet, AllocationError>>
fn allocate_sets<I>(&mut self, layouts: I, sets: &mut Vec<B::DescriptorSet>) -> Result<(), AllocationError>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this signature could be confusing. For example, it's unclear from the signature whether the sets's existing contents are overridden or extended. Would it be possible to somehow return impl Iterator and let the caller decide what to do with the newly allocated sets? Then I guess we could save a Vec allocation in allocate_set too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a few attempts to make this API nicer before, and failed. impl Iterator still has to be one concrete type, and different backends would not agree on it. Having it as an associated type of this trait then prevents any sort of default implementation... You are welcome to try it out :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to figure this out, because I think we could use that pattern in a few places. Anyway I don't think it's worth holding up this PR, we can change it later if we figure it out (or if impl Iterator widens type support)

@@ -208,7 +208,7 @@ struct State {
work_group_size: MTLSize,
primitive_type: MTLPrimitiveType,
resources_vs: StageResources,
resources_fs: StageResources,
resources_ps: StageResources,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fs is probably the correct Vulkan terminology if we're trying to stay consistent between HAL/Vulkan still

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but there is a line we need to draw between Vulkan the API and Metal the implementation. I felt like these resource states are closer to the latter and don't need to stick to Vulkan terms.

}

impl PipelineLayout {
/// Tet the first vertex buffer index to be used by attributes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tet -> Get


impl PipelineLayout {
/// Tet the first vertex buffer index to be used by attributes.
pub(crate) fn attribute_buffer_index(&self) -> u32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could inline this too

assert!(is_encoding);
match journal.passes.last() {
Some(&(soft::Pass::Render(_), _)) => {
journal.render_commands.extend(commands.into_iter().map(soft::RenderCommand::own));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how journal.{render_commands|compute_commands} are used in these cases now? We have a few state switches now between render passes/active encoders and it's a bit hard to see

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR doesn't really change any journaling logic, just merges the code paths that were not DRY before.

For compute and render, we have 2 versions of the commands processing: one that ignores everything if we are not in the pass (because it's going to be stored in the internal state and still flushed when we start one, later on), and one that expects the current pass. What my cleanup does is implementing the latter on top of the former.

@kvark
Copy link
Member Author

kvark commented Jul 21, 2018

@grovesNL thanks for the review! I addressed the notes (some in code, some in comments). Please check out the answers and proceed if satisfied.

@kvark
Copy link
Member Author

kvark commented Jul 23, 2018

I've added some basic time tracking:

Immediate Deferred Remote
record begin to submit 9.0ms 8.0ms 10.6ms
record finish to submit 0.02ms 0.03ms 0.02ms
frame wait 0.015ms 0.015ms 0.015ms
submission 0.05ms 2.9ms 0.5ms

From here it's fairly obvious why the Remote sink doesn't show any benefit: it is expected to save us some submission time comparing to Deferred, but that gain is overweight by the increased time to record. Trick is - that recording time wasn't supposed to be affected at all. It's only bad because we aren't yet able to re-use the heap allocated for the commands.

Copy link
Contributor

@grovesNL grovesNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining changes seem reasonable to me

@grovesNL
Copy link
Contributor

bors r+

bors bot added a commit that referenced this pull request Jul 24, 2018
2260: Remote command sink in Metal r=grovesNL a=kvark

Fixes #2259
The results so far are not super promising - highly unstable (presumably, because of the dispatch), with performance around `Immediate` mark. We are still missing the most important follow-up here - to avoid any heap allocations when recording commands. Currently, it just goes with `Vec::new()` and grows it for each pass, which shows up in the profile quite a bit.

The PR also has a bunch of stuff in general optimizations:
  - HAL change in the descriptor allocation API to avoid the heap
  - lighten up Metal descriptor binding path (a bit) by making sure there is enough state slots in advance
  - simplification and refactoring of `CommandSink` implementations

PR checklist:
- [ ] `make` succeeds (on *nix)
- [ ] `make reftests` succeeds
- [x] tested examples with the following backends: Metal
- [ ] `rustfmt` run on changed code


Co-authored-by: Dzmitry Malyshau <[email protected]>
@bors
Copy link
Contributor

bors bot commented Jul 24, 2018

@bors bors bot merged commit b5266e7 into gfx-rs:master Jul 24, 2018
@kvark kvark deleted the mtl-remote branch July 24, 2018 01:59
bors bot added a commit that referenced this pull request Jul 24, 2018
2264: Render pass descriptor cache for Metal r=grovesNL a=kvark

~~Includes #2260~~
Fixes two of the performance issues in  #2161 (RP desc locking and copying costs).

Immediate recording FPS doesn't seem to change, maybe slightly lower (touching 90 from below more than from above, as I recall it doing - need to confirm. Edit - confirmed to not be caused by the PR).
Deferred recording FPS seem to go from lower 100s to higher, or even from 100 to 110, roughly speaking. There are barely any bottlenecks left for it, outside of the general architecture. Main thread now spends about 14.5% in our code, which is mostly covered by driver interaction.

PR checklist:
- [ ] `make` succeeds (on *nix)
- [x] `make reftests` succeeds
- [x] tested examples with the following backends: Metal
- [ ] `rustfmt` run on changed code


Co-authored-by: Dzmitry Malyshau <[email protected]>
@kvark kvark mentioned this pull request Jul 24, 2018
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants