-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote command sink in Metal #2260
Conversation
src/backend/vulkan/src/native.rs
Outdated
.map(|sets| { | ||
output.extend(sets | ||
.into_iter() | ||
.zip(layout_bindinds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change isn't part of this PR, but layout_bindinds
is a typo here
} | ||
|
||
/// Allocate one or multiple descriptor sets from the pool. | ||
/// | ||
/// Each descriptor set will be allocated from the pool according to the corresponding set layout. | ||
/// Descriptors will become invalid once the pool is reset. Usage of invalidated descriptor sets results | ||
/// in undefined behavior. | ||
fn allocate_sets<I>(&mut self, layouts: I) -> Vec<Result<B::DescriptorSet, AllocationError>> | ||
fn allocate_sets<I>(&mut self, layouts: I, sets: &mut Vec<B::DescriptorSet>) -> Result<(), AllocationError> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this signature could be confusing. For example, it's unclear from the signature whether the sets
's existing contents are overridden or extended. Would it be possible to somehow return impl Iterator
and let the caller decide what to do with the newly allocated sets? Then I guess we could save a Vec
allocation in allocate_set
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a few attempts to make this API nicer before, and failed. impl Iterator
still has to be one concrete type, and different backends would not agree on it. Having it as an associated type of this trait then prevents any sort of default implementation... You are welcome to try it out :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool to figure this out, because I think we could use that pattern in a few places. Anyway I don't think it's worth holding up this PR, we can change it later if we figure it out (or if impl Iterator
widens type support)
@@ -208,7 +208,7 @@ struct State { | |||
work_group_size: MTLSize, | |||
primitive_type: MTLPrimitiveType, | |||
resources_vs: StageResources, | |||
resources_fs: StageResources, | |||
resources_ps: StageResources, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fs
is probably the correct Vulkan terminology if we're trying to stay consistent between HAL/Vulkan still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but there is a line we need to draw between Vulkan the API and Metal the implementation. I felt like these resource states are closer to the latter and don't need to stick to Vulkan terms.
src/backend/metal/src/native.rs
Outdated
} | ||
|
||
impl PipelineLayout { | ||
/// Tet the first vertex buffer index to be used by attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tet
-> Get
src/backend/metal/src/native.rs
Outdated
|
||
impl PipelineLayout { | ||
/// Tet the first vertex buffer index to be used by attributes. | ||
pub(crate) fn attribute_buffer_index(&self) -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could inline this too
assert!(is_encoding); | ||
match journal.passes.last() { | ||
Some(&(soft::Pass::Render(_), _)) => { | ||
journal.render_commands.extend(commands.into_iter().map(soft::RenderCommand::own)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how journal.{render_commands|compute_commands}
are used in these cases now? We have a few state switches now between render passes/active encoders and it's a bit hard to see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR doesn't really change any journaling logic, just merges the code paths that were not DRY before.
For compute and render, we have 2 versions of the commands processing: one that ignores everything if we are not in the pass (because it's going to be stored in the internal state and still flushed when we start one, later on), and one that expects the current pass. What my cleanup does is implementing the latter on top of the former.
@grovesNL thanks for the review! I addressed the notes (some in code, some in comments). Please check out the answers and proceed if satisfied. |
I've added some basic time tracking:
From here it's fairly obvious why the Remote sink doesn't show any benefit: it is expected to save us some submission time comparing to Deferred, but that gain is overweight by the increased time to record. Trick is - that recording time wasn't supposed to be affected at all. It's only bad because we aren't yet able to re-use the heap allocated for the commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining changes seem reasonable to me
bors r+ |
2260: Remote command sink in Metal r=grovesNL a=kvark Fixes #2259 The results so far are not super promising - highly unstable (presumably, because of the dispatch), with performance around `Immediate` mark. We are still missing the most important follow-up here - to avoid any heap allocations when recording commands. Currently, it just goes with `Vec::new()` and grows it for each pass, which shows up in the profile quite a bit. The PR also has a bunch of stuff in general optimizations: - HAL change in the descriptor allocation API to avoid the heap - lighten up Metal descriptor binding path (a bit) by making sure there is enough state slots in advance - simplification and refactoring of `CommandSink` implementations PR checklist: - [ ] `make` succeeds (on *nix) - [ ] `make reftests` succeeds - [x] tested examples with the following backends: Metal - [ ] `rustfmt` run on changed code Co-authored-by: Dzmitry Malyshau <[email protected]>
2264: Render pass descriptor cache for Metal r=grovesNL a=kvark ~~Includes #2260~~ Fixes two of the performance issues in #2161 (RP desc locking and copying costs). Immediate recording FPS doesn't seem to change, maybe slightly lower (touching 90 from below more than from above, as I recall it doing - need to confirm. Edit - confirmed to not be caused by the PR). Deferred recording FPS seem to go from lower 100s to higher, or even from 100 to 110, roughly speaking. There are barely any bottlenecks left for it, outside of the general architecture. Main thread now spends about 14.5% in our code, which is mostly covered by driver interaction. PR checklist: - [ ] `make` succeeds (on *nix) - [x] `make reftests` succeeds - [x] tested examples with the following backends: Metal - [ ] `rustfmt` run on changed code Co-authored-by: Dzmitry Malyshau <[email protected]>
Fixes #2259
The results so far are not super promising - highly unstable (presumably, because of the dispatch), with performance around
Immediate
mark. We are still missing the most important follow-up here - to avoid any heap allocations when recording commands. Currently, it just goes withVec::new()
and grows it for each pass, which shows up in the profile quite a bit.The PR also has a bunch of stuff in general optimizations:
CommandSink
implementationsPR checklist:
make
succeeds (on *nix)make reftests
succeedsrustfmt
run on changed code