Replies: 10 comments 23 replies
-
Beyond the technical merits you've outlined for these three paths, I'd like to discuss the use of the two potential dependencies - rafx and wgpu. You've touched on this in a few different places, but I think it might be worth highlighting on its own. I can distill my thoughts down to two areas: goals and risk. I hope this doesn't come off as a platitude, but how well do Bevy's goals align with wgpu and rafx? Assuming the two projects continue in perpetuity, are they - or at least the sub-APIs we intend to use - going to meet our needs? You've outlined the current state in the "wgpu vs rafx-api" section really well, but I'm also curious of the future goals of the projects. Is wgpu aiming to be a general-purpose renderer? Is rafx trying to target certain use-cases? Are either of them hoping to target consoles? Is there a risk Bevy will diverge with their stated goals? While rafx looks very promising in your technical comparison, it also looks like a much higher risk for abandonment. That's not to say I expect the maintainers to just walk away from it, but wgpu has, as you said, a full-time developer and a much more investment. Are we prepared to fork/take ownership of rafx if it loses steam? On top of that, wgpu already has an ecosystem growing around it - learning materials, community, example content. Those things have a lot of hard-to-quantify value, especially for an open-source project like Bevy that will really benefit from having a larger community of contributors to draw from. I understand you mentioned most, if not all, of these points. However, my perceived risk of using rafx long term gives me pause. That said, I don't know what that value tradeoff is; maybe there are technical merits of rafx that outweigh the associated risk! |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing this @cart! As a note, I am generally positive to a lot of the things you've written. To try to be more brief I will avoid +1-ing things and try to only comment on things where I think I have something to add.
The modularity is going in a good direction, but I'm not sure full render graphs can be implemented in a way that does not require writing the render graph code to glue the pieces together in a way that is efficient and makes sense. Also, modifying things most of the time means modifying the main pass to make use of the thing you added so without a way of composing shaders, people will anyway have to fork the main pass shaders and adapt them to all their pieces.
For transparency and context on my opinions - I'm a bit double on abstracting away external dependencies. In the long term it's useful to be able to swap out backends without having to change user code. But it also adds another layer (or multiple layers) on top where features have to be added to expose underlying functionality in the external dependency. It risks lowest common denominator abstraction that limits what can be done. That said, the existence of bevy_render, bevy_wgpu doesn't require their use, so this abstraction approach doesn't have to limit developers, rather provides an option and solid default.
The automatic bindings was a nice idea but as noted previously, debugging it is very difficult. Also, @mtsr added a GlobalRenderResourcesNode in a PR to be able to bind ECS Resources but trying to understand how to add texture support to it by looking at the implementations of the other resource nodes was really difficult. Bindings must be flexible and simple to implement in order to support any custom shader work, whether it's custom materials, main renderer techniques, post-processing, whatever. Getting data in/out and routing it to the right place needs to be simple for good developer UX. If we don't have a good 'automatic' way of binding, then there should exist a simple, explicit, manual way of doing it to enable the exceptions.
I asked on Discord but I think it's useful context to the discussion of the prototype later - why did you choose to implement drawing as many sprites on screen at 60fps as possible as the prototype to focus on? Is it because this is a good basic test of renderer overhead?
I like the sound of this goal, and I appreciate prioritising UX for whoever the users are. I think perhaps the render resources stuff was 'rushing' into a higher-level solution before the fundamental pieces were in place. I think if we can find the right fundamentals, we will be able to support good convenience APIs as well as providing still-simple-but-probably-verbose fallbacks for when those don't work. If a convenience API doesn't work for someone's use case, we don't want to leave them hanging.
I'm trying out rafx by implementing the SSAO stuff I've been doing in a fork of bevy, in the rafx demo. I wanted to get a feel for its APIs to help me learn how things are in rafx, other ways of doing these things, generally get more knowledge about renderers to be able to have more informed / less naive opinions.
What are ZSTs? This is a neat idea and looks clean from the PoC code. This solution allows us to get the data from the app world, put it into data structures that help prepare the data for rendering, do that preparation, and render.
This looks clean and should encapsulate getting the data from the app, to the render systems, bound and ready for rendering. I like it.
When reading, it was clear to see how having the renderer as a sub-app with its own schedule would enable pipelining.
I feel like this is a stupid question because I just don't know enough but, how do texture bindings fit into this?
How do you imagine we would handle updating them? It seems, similar to having many draw calls, making lots of buffer copy calls is slow. At the same time bandwidth is limited. I feel like we need some way of efficiently updating parts of such uniform vectors that tries to both minimise copy calls and amount of data copied.
I like this. Presumably these *Vecs containing only one item is also just fine?
Why did you 'vendor' Crevice? Did you have to make changes to it?
I think for bindings in general we need to consider the different binding rates and data sources. Rafx suggests per view/material/object bindings, as well as per-pass configuration. And from what I've been doing so far, it's seemed odd to me that I haven't been able to bind an ECS resource. I feel one should be able to bind components or resources, and if they contain handles to assets then that needs to be handled (ha! ;p ) too.
While working on SSAO with the current bevy_render, I had to use another of @mtsr's PRs to make Draw and RenderPipelines generic on the pass component, so that I could run passes over the mesh entities in a depth/normal pre-pass and in the main pass. How do you intend to handle that?
Where is ParamState meant to be used?
Don't forget Res (ECS resources)! :)
I saw the sprite and camera extraction and preparation code and it looks very clean. I like it!
I am happy that I won't have to debug this again. :) I'm sure I will have to debug new problems though. Hopefully the new problems need debugging less and are simpler to debug.
If it's clean to do, I'm happy this is becoming explicit too.
This made me pause - is it really better to reconstruct all this data on the CPU every single frame? Is the amount of data just using quads and model matrices really that much? 80k * 4*4 * 4 bytes per float is just over 5MB which doesn't seem like much data to me. Still, this isn't relevant to the discussion as this structure allows you to do whatever you like. It's a detail. :) That it is a detail is a great strength.
This flexibility is important and great!
I like this for defining contained units of processing. It would probably be good if we try to implement them so they can also output intermediate results for reuse in other sub-graphs.
I am of course interested in trying to implement SSAO within this setup. :) It doesn't exercise different views, but it does exercise different bindings.
YES. I care about this.
Good question. It makes me think about the per-view/-material/-object/-pass data again.
It should be generally available within a view, in my opinion.
What do these templates look like in practice? They define what data is available to you and what you have to give back and then your shader can slot into the hole?
I like flat APIs. I don't like deep APIs. Deep APIs too often add a lot of structure that is difficult to rework, adding new features is slow, and it's difficult to understand because of the many layers and that it exists in that one framework and is non-transferable knowledge. I don't have an opinion on wgpu vs rafx-api yet though.
I like the flatness. I do like the simplicity of wgpu's APIs compared to raw Vulkan though. I don't know what rafx-api looks like yet.
I do think your implementation is simple. I need to look at the rafx sprite feature to compare but I'll have to do that another time.
This think this is good an necessary to be able to do in some way to be able to configure renderer settings. Many games require some kind of restarting of the game or so to change settings. They don't necessarily have to be something you can toggle from one frame to the next, but if it's not complicated to do, I think it's nice. If nothing else, it's nice for development to be able to toggle things on and off to compare. That makes for nice engine demo videos. :)
I like where you're going with this.
My main concern about the tight integration in bevy approach is that it is a layer of abstraction on top of other things and that app developers will need to deal with the wait for features in graphics APIs to bubble up through the layers so they can start to use them. And if the layers are opinionated and want to be done well, that always takes time. Again, I recognise that if this is too much of a problem for people, then they can bypass bevy_render and related and make their own renderer that uses APIs directly, as well as being on their own with porting to/from that approach. |
Beta Was this translation helpful? Give feedback.
-
I think it is important to introduce views as a clear concept in the renderer. They nicely scope the execution of a graph from the perspective of a camera and should serve as a sensible point of collection for code relating to that. On plugins as subgraphs - if the plugins are low enough level then I can imagine implementing a depth/normal prepass plugin which has depth and view space normal textures as output, and gets run for whatever views you configure it to be run for, obtaining its camera bindings from the view and culling/identifying visibility for the view, sorting from the perspective of the view, etc. Then another plugin for SSAO or other AO implementations that takes depth and normals from the depth/normal prepass, a noise texture from the app’s world (following @cart’s model), and outputs an AO texture. Again, the camera bindings are provided from the view. A blur plugin could/would run an X pass and a Y pass on the input texture and provide an output texture (maybe you provide both input and output textures so you can swap them over for a second pass to save space and/or see the intermediate state or something) and that could then be used both for SSAO and for bloom, just depending on what texture you give it. So it would need to handle different texture formats and blur the components appropriately. I just realised that the blurring for SSAO needs to have depth as well to avoid blurring AO across significant depth differences or around corners, so maybe in practice it’s a different pass or a different shader or so but if it didn’t, that kind of plugin structure sounds like a nice unit. Render features in the Bungie Destiny architecture are more end to end though. I don’t know how they share things but it feels like a layer on top of this that says ‘I need to run depth/normal prepass, SSAO, blur, hooked up and configured in this way’ and something else says ‘I need depth/normal prepass, opaque pass, hooked up and configured in this way, and then something that hooks the SSAO into the main pass. These are fuzzy thoughts but I’m seeing hierarchical groupings. I feel like these are really hard problems to solve up front though and it would likely be better to take this in a couple of stages where we try to pin down the foundations, then build some stuff and as we build, all that information feeds into what we need for convenience layers on top. To me it feels way too complicated to try to design up front. What do you think? |
Beta Was this translation helpful? Give feedback.
-
More thoughts on wgpu and rafx - I think the division of crates, if I have understood the purpose of each of them, is good:
Given this understanding of bevy and rafx's render infrastructures is correct, I see two very separate concerns as two possibly common questions and decision points:
If we want to be able to change out renderer infrastructure without affecting app code (too much?) then we need to have types that can be used when developing apps that will be easy to map to bevy_render and/or rafx-framework. I feel like these types should/must live outside of bevy_render so that one can build things without bevy_render. Do you agree? If we want to be able to change out the graphics API / graphics API abstraction then we will need a well-defined interface between bevy_render (or rafx-framework would need this) and said API. These seem to be bevy_wgpu, and rafx-api. I personally am not so concerned about the rafx-api versus wgpu question. Perhaps others are and I don't mean to say this should not be discussed now if it is something that others think is important to decide at this point. However, from what I have seen of the activity or desired activity in the community, being able to build renderer features is the focus and priority need. That is the question about the high-level API, so rafx-framework or bevy_render. If we had the common types that apps use, we could build both in parallel to test that we can swap out the renderer without needing to change app code, if that is an interesting and desirable goal. Is it? The proposed design is similar to the Bungie Destiny architecture and similar to rafx. I think that's a good thing, it seems to be a good renderer architecture for performance and flexibility. As for whether to do one, the other, or both... let the discussion continue. :D |
Beta Was this translation helpful? Give feedback.
-
Why would this be controversial? From my perspective working with end users, this slightly reduces boilerplate and is unlikely to have any other consequences in the common case. |
Beta Was this translation helpful? Give feedback.
-
I like this quite a bit; I think this should get its own PR independent of the rendering work. |
Beta Was this translation helpful? Give feedback.
-
This conversation continued on Discord To summarize:
|
Beta Was this translation helpful? Give feedback.
-
BenchmarkFirst, I want to make a few comments about this benchmark. As a rough litmus test for “can my system draw >10k things”, it might be ok, but it’s not a realistic workload.
If you implement this benchmark naively in rafx, it will run slower. As an example lets say we draw 30k sprites/frame.. those 30k sprites result in 30k visibility structure updates per frame (because everything is moving), a visibility query across all 30k sprites that culls nothing (because everything is on screen), and unsuccessfully trying to batch the 30k sprites. (Because none of the sprites are on the same Z level). Also keep in mind that having a visibility system results in extraction being random-access to all visible entities instead of a linear-access query across all entities. So that’s slower too. We tried an experiment where we stripped out visibility, sprite batching, and much of the frame packet plumbing (that would enable us to split heavy jobs across threads), and exceeded the prototype’s performance by 20% (measured by number of entities before frame rate went below 60fps). It was a worthwhile learning exercise, but we believe removing these systems will be harmful for real workloads. Keep in mind, if you have a bunch of static sprites (the common case), you can batch them together offline and treat chunks of them as single entities. This is exactly what we do in our LDTK (tile map editor) render feature. The processing happens in distill when the asset is imported. At runtime, rendering the largest LDTK example map requires no visibility updates, a query across 20 visible objects, and no vertex/index buffer allocation or sprite batching logic. This is certainly apples/oranges comparison with the benchmark. But I think just in general, backing up and asking “why do I need to render this many sprites” produces a better solution in the end. So I would be careful with this benchmark as it may lead to optimizing the wrong things. Responding to a few comments
You can enable multiple crate features (I.e. —features = “rafx-vulkan,rafx-metal”) to produce a binary with as many backends as you like. Then, you can attempt to initialize them in your preferred priority order, falling back to a different choice until you run out of choices. I do think this is something that should be improved on in rafx-api. “Which backend should be preferred if multiple are available” is an opinionated choice though. I think rafx-api should be extended to provide a bit more data about the GPUs/APIs available on the system, and some other high level code should choose (possibly referencing a config file that bans certain GPUs from certain APIs due to known bugs.) I don’t see much technical risk here, just a matter of prioritization. We’re generally tacking the highest-technical-risk issues first.
I think any abstraction that introduces new concepts (such as the ones in the destiny talk) will need names for those concepts. (For example, anything to do with visibility, or the plumbing that merges/sorts draw calls.) This is exacerbated by some concepts (like descriptor sets) needing multiple levels of abstraction (i.e. an API-specific descriptor set vs. something higher level and ref counted). Some systems also become more complicated in order to support multi-threaded usage without excessive locking. For example in rafx-framework, you might first acquire an allocator, and then use that to create N of something else. This allows us to have a single critical section that isn’t beholden to a graphics API call returning quickly. This adds new types: the allocator, the “allocator allocator”, plumbing for some sort of chunking (which iallows locking granularity to be somewhere between “global” and “one per instance”.) I’m not sure how bevy will avoid the same problem if it has something similar to rafx-framework in it (aside from cutting features.) While I'm sure there are some improvements that could be made, the solution is rarely simpler than the problem itself, and I think it’s easy to underestimate the complexity of this problem (assuming you want to scale - both in terms of performance and supporting a wide variety of use-cases.)
The reason rafx has more structure is because it uses the “frame packet” approach described in the destiny talk. This architecture reduces duplicated work and allows for parallel processing. (Both processing multiple features in parallel, and splitting a single feature’s workload across multiple threads)
Visibility/culling also needs to be considered here, and the frame packet approach fits this well. I think the prototype will need a significant amount of changes and API redesign to support this.
There are three macros, but they are slightly different flavors of the same thing. Here’s the code for them: The main reason they exist is to allow all phases/features to be registered at runtime with an integer that’s 0..N (friendly to array indexing and bitfields), which can be accessed from anywhere with something like I have strong distaste for macros too! I tried very hard to avoid them, but I didn’t find a better solution that allowed BOTH non-intrusive registration of new features, but easy and cheap access to the registered index from anywhere.
This is something I’d like to try to solve in the future. It will stay “immediate mode” because render graphs could change significantly from frame to frame depending on many factors. There’s no reason an immediate mode API could not also allow for “patching” the graph that has been built so far by other plugins.
This is by design to permit jobs to run concurrently. Anything that needs to be shared across jobs of different types can be registered as a render resource. There’s no reason an ECS world couldn’t be a render resource, but I haven't personally found a case where sharing per-entity data across features is useful. Miscellaneous
Closing thoughts
|
Beta Was this translation helpful? Give feedback.
-
How does What's the best choice for Bevy, considering that |
Beta Was this translation helpful? Give feedback.
-
Just adding my 2 cents: No matter what low-level API we're going to use, I would like to maintain the ability to retrieve the (dx/gl/ash) instance and directly call the APIs in my render graph. My work involves custom voxel raytracing pipelines that are highly specialized, use many API-specific features, and differ significantly from a traditional rasterization pipeline. For example, my current implementation uses Sparse Binding and Sparse Residency for manual memory management, and I use a compute shader to directly render onto the framebuffer. Right now, I completely disabled bevy_wgpu and bevy_render and render directly onto the winit window each frame. Ideally, I still want to take advantage of the bevy rendering pipeline for a small number of rasterized items and UI. The custom raytracing pipeline would render to an image which gets blended to the framebuffer during rasterization. Unfortunately, this is not easy to do, primarily due to the fact that wgpu hides the wgpu-core instance which hides the gfx instance which hides the ash instance. If we use rafx instead, I would imagine that it would be much easier to fully expose the underlying APIs. |
Beta Was this translation helpful? Give feedback.
-
bevy_render: The Current State of
main
In my opinion the current
bevy_render
gets a lot of things right:However it also has a number of significant shortcomings:
Bevy is now being used at a scale where these shortcomings are no longer acceptable. Its time to rework our rendering abstractions. My goals for this rework:
Potential Paths Forward
There are many paths we could take here, but I want to scope this conversation to three options:
bevy_render
and continue using wgpubevy_render
, but use rafx-apibevy_render
RenderContext / RenderResourceContext abstractions.bevy_render
Rework: Initial Proof of ConceptThis first proof of concept fleshes out Path (1)
The code is available here
All pipelined code / plugins live in the top level
pipelined
folder. This rework is completely decoupled from the original render code (it isn't a full rewrite, but it does change a lot).This currently aims at making "low-ish level" code pipelined. The high level abstractions have been stripped out and new ones will need to be designed and built. But we should focus on making the low-ish level abstractions good first.
Render App Model
Render App Stages
SubApps
To enable a separate "app world", "app schedule", "render world", and "render schedule", I added SubApps. SubApps have their own World and Schedule. They are owned by the main App. Currently they are identified by an integer index for simplicity of implementation, but a final implementation would probably use ZSTs for identifiers.
Currently I don't actually parallel-pipeline subapp execution because it will involve some thought on how we interact with winit. But the pieces are all there / the dataflow is defined in the right way.
I am not convinced this is the best api yet, but its relatively simple and gets the job done. We've been discussing apis like this in the SubWorlds RFC.
New Abstractions
BufferVec<T: Pod>
,UniformVec<T: AsStd140>
, andDynamicUniformVec<T: AsStd140>
Draw
andDrawFunctions
Draw
traitsRenderPhase
MainPassNode
TrackedRenderPass
ParamState<(Res<A>, ResMut<B>)>
ParamState::get(world)
to return system param values.Removals and Tweaks
Res<Box<dyn RenderResourceContext>>
->Res<RenderResources>
(derefs to &dyn RenderResourceContext). This just makes it nicer / more ergonomic to deal with render resources in userspaceDrawing Sprites
The SpriteNode also does the final copy from staging buffers to final buffers, but I'm planning on making it easier to do this in the Prepare step without creating new nodes to run these commands.
The pipelined sprite code is here: https://github.com/cart/bevy/tree/pipelined-rendering/pipelined/bevy_sprite2/src
Pipelined BevyMark
Results are currently quite good relative to other options. The PoC can currently render ~89,000 sprites at 60fps on my machine, which is better than all of the other results listed at the beginning of this post. Theres also plenty of room for improvement. We can try drawing everything with a single draw call, moving the mesh data into the shader, replacing Asset hashing with generational indexing, etc.
You can test this by running
cargo run --example bevymark_pipelined --release
Next Steps for PoC
Next Steps for Productionizing PoC
If we decide to take the "extend bevy_render" path, this is some work we'd need to do to make it "production ready"
spawn_and_forget
hack. Commands would allocate entity ids using the "render world" entity allocatorDiscussion Kickoff
declare_render_feature!(SpriteRenderFeature, SPRITE_FEATURE_INDEX)
, which I would very much prefer to not use in bevy, as I try to avoid macros whenever possible.Beta Was this translation helpful? Give feedback.
All reactions