Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acyclic Command Graph for Rendering Device #84976

Merged
merged 1 commit into from
Jan 9, 2024

Conversation

DarioSamo
Copy link
Contributor

@DarioSamo DarioSamo commented Nov 16, 2023

Background

@reduz proposed the idea a while ago of refactoring RenderingDevice to automatically build a graph out of the commands submitted to the class (outlined here). The first stepping stone towards implementing this was @RandomShaper's PR (#83452) that splits the commands into easily serializable parameters. Therefore, merging this PR will require merging that PR first (and if you wish to review this, only look at my individual commit after Pedro's changes).

Improvements

This PR makes the following improvements towards implementing @reduz's idea.

  • RenderingDevice's complexity has been drastically reduced as it no longer needs to solve pipeline barriers or layout transitions for pretty much all its functionality. This responsibility has been delegated to a new class called RenderingDeviceGraph.
  • Overall the total amount of vkCmdPipelineBarrier calls has been reduced immensely. On average I've noticed a reduction of about 60-80% of the total amount of barrier calls in a frame when compared to master.
  • RenderingDeviceGraph is capable of reordering commands based on the dependency between the resources used to push the submitted commands to be processed as early as possible. This gives the driver much better chances of parallelizing the work effectively.
  • RenderingDeviceGraph will group as many possible barriers as possible in 'levels' depending on the usage of the resources. These barriers are submitted before the commands of the level are processed to perform any layout or synchronization barriers that are required.
  • RenderingDevice's API has been simplified as some parameters are no longer required.
    • Barrier bitmasks are gone. They no longer serve any purpose.
    • Draw list and compute list overlapping no longer needs to be specified.
    • 'Split draw lists' are gone as they can be automatically recorded by the graph instead, and has been shown to be viable already (although disabled behind an experimental macro for now).
    • Draw lists no longer need to specify their initial and final actions in excruciating detail. The operations are much simpler now. Load, clear or discard for initial. Store and discard for final. The detail behind the original action no longer serves any purpose, as the graph will automatically skip any transitions that are not required if commands that use the same image layout are chained together (e.g. render passes).
    • Draw lists no longer need to specify storage textures as it's not required at all.
  • Both Forward+ and Mobile have been adapted to use the new API and have overall resulted in a net removal of code complexity that is no longer required due to the graph automatically solving what it was doing already.
  • A lot of existing Vulkan synchronization errors caused by the current barrier system have been solved automatically by the graph.

Implementation

Since commands need to be deferred until later for reordering, everything submitted to RD is serialized into a large uint8 vector that grows as much as necessary. The capacity of all these vectors is reused across recordings to avoid reallocation as much as possible. The recorded commands are the bare minimum the RenderingDeviceDriver needs to execute the command after reordering is performed.

As expected, this PR will add CPU overhead that was not there before due to the additional time spent recording commands and reordering. While there's a very obvious optimization that is indicated in the TODO list that is pending, the rest of the extra work won't be as easy to overcome. This extra cost can also be offset tremendously by using secondary command buffers, which is pending to be enabled in the PR as soon as the issue described in the TODO is figured out.

Compatibility breakage

  • RenderingDevice's binary compatibility is not guaranteed as expected due to a lot of arguments being removed from the functions. I've provided compatibility wrappers as required, although I'm not quite clear if they work as intended yet.
  • If the compatibility wrappers work as intended, there should be no need to change the behavior of the code dependent on RD: pretty much most of the time the additional detail that was provided to the functions is just ignored as the graph can solve it on its own.
  • In some cases it is possible compatibility breaks because the graph performs additional layers of validation as to whether some operations are allowed. The validation that RenderingDeviceGraph includes:
    • Checking whether the same resource is used with different usages in the same command. This was found to be the case already with a couple of effects that used the indirect buffer as both dispatch and storage, leading to UB.
    • Checking if multiple slices of the same resource in different usages are used in the same command and have overlap. This is not allowed as the layout transitions are impossible to solve effectively and can lead to race conditions in the GPU. Luckily, such a case was not found in the existing rendering code as far as I could find, but if some code runs into this it means it has to be fixed on the user side and not on the graph.

Performance improvements

GPU performance improvements are expected across the board as long as the CPU overhead isn't slowing the game down (which should go down with the future immutable change). The performance improvements will also vary depending on how much the particular IHV was suffering from inefficient barrier usage. One area that will particularly benefit is projects using GPU particles, as their processing will be much more parallelized than it was before.

At least on an NVIDIA 3090ti I've noticed around an overall ~10% frame time improvement in several projects, with potential bigger wins in platforms like AMD that can parallelize effectively on single queue or mobile hardware that does not handle barriers as gracefully as NVIDIA does.

Future improvements

These will be researched after this PR is merged as a second iteration.

  • Dedicated transfer queue for resources that use the setup command buffer that can run in parallel and synchronize when it's time to process the drawing commands.
  • Support for multiple graphics and compute queues that will split the work of the graph to support parallelization on hardware that can take advantage of it more effectively (like NVIDIA).

TODO

  • Debug broken uniform set in TPS demo (will be fixed in Pedro's PR soon).
  • Debug strange memory usage increase far beyond what should be expected.
  • Update documentation to match the new API.
  • Fix the C# glue error that is not getting generated properly for some reason due to the RD API change.
  • Double check if MSAA is working on Forward+ and Mobile.
  • Attempt new mutable/immutable tracker design that does not require explicit flags from the engine. This requires working out a way to refresh the trackers used by vertex, index and uniform set arrays once those resources turn into mutables. All dependencies must be made mutable.
  • Debug strange issue in NVIDIA where the editor will show up completely black when using secondary command buffers depending on the contents of the draw list. This currently blocks secondary command buffers from being enabled. I've been unable to determine the root of the issue so far. (Postponed until we get feedback from NVIDIA)

Production edit: closes godotengine/godot-roadmap#29

@Calinou Calinou added this to the 4.x milestone Nov 16, 2023
@DarioSamo DarioSamo force-pushed the rd_common_render_graph branch 8 times, most recently from 3e9b18a to 7a3b4e6 Compare November 28, 2023 13:14
@DarioSamo DarioSamo force-pushed the rd_common_render_graph branch 2 times, most recently from 3bd08dd to 7973cd0 Compare December 4, 2023 14:17
@DarioSamo DarioSamo marked this pull request as ready for review December 4, 2023 16:45
@DarioSamo DarioSamo requested review from a team as code owners December 4, 2023 16:45
@DarioSamo
Copy link
Contributor Author

Opening this for review. We won't take any steps towards merging these until the elements marked out in the TODO are done and the PR this is based on is merged, but I expect it to take a while to effectively review all these changes.

@DarioSamo DarioSamo changed the title Acyclic Command Graph for Rendering Device [Prototype] Acyclic Command Graph for Rendering Device Dec 4, 2023
@BastiaanOlij
Copy link
Contributor

Some initial testing, after fixing an issue in @RandomShaper PR, this is working on both desktop VR (tested with an Valve Index headset) and on Quest (tested on Quest 3). I have more testing to do.

This was done with the mobile renderer, there weren't any obvious performance improvements but I wasn't expecting any as the mobile renderer already has a minimum of passes, there isn't much for the graph to optimise.

I did notice when testing on the Quest 3 that MSAA broke, as far as I can tell it looks like it's resolving MSAA before it has finished rendering, so there is an issue in barriers. I did not test this with just @RandomShaper PR, so not 100% sure if this is introduced by the acyclic graph or if we're missing something in the new barrier code.

@DarioSamo
Copy link
Contributor Author

In a weird twist of fate, it seems enabling buffer barriers also fixes the issue while retaining the ability to both reorder the graph and not have to rely on full barriers to synchronize on the AMD Radeon RX Vega M.

Since the main suspect is the compute skinning right now, it might be a good idea to try to exaggerate the issue on the project by creating multiple skinned characters so if any race condition exists, it'll be more likely to show up.

@DarioSamo
Copy link
Contributor Author

The conclusions after talking with @akien-mga seem to be so far:

  1. The issue does not happen if buffer barriers are used instead, no matter how much the system is pushed to duplicate as many animated characters as possible. It happens instantly if buffer barriers are re-enabled.
  2. The issue does not happen in the Windows 10 artifact from the PR on the same hardware with regular memory barriers instead.

We should probably discuss how to approach this, as it could be a hint of something being wrong in the driver/system combination itself. Reviewing the RenderDoc capture has not revealed anything apparent nor does the validation or synchronization layer show any errors about it. It might be possible to build a standalone sample using Vulkan that replicates the issue if we want to dedicate time to that.

Alternatively, we can enable buffer barriers by default at the cost of some performance and trusting that IHVs implement them correctly (or at least basically translate them to global barriers internally).

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dario and I discussed this on chat. The issue that Remi is facing appears to go away when using buffer barriers, when using full barriers, or when not reordering commands. We suspect that it is caused by a driver bug as it only appears to reproduce on a specific combination of hardware.

For most drivers, buffer barriers are ignored/promoted to memory barriers. In theory their should be some overhead from adding them, but testing has shown that the overhead is minimal.

Accordingly, our plan is as follows:

  1. Enable buffer barriers by default
  2. If @akien confirms that this new update works fine on his hardware, merge this before Dev2
  3. Create an MRP for AMD/MESA and submit a bug report.
  4. Remove the buffer barriers once the problematic driver is fixed, or once we find a better workaround

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Jan 6, 2024

Well this is a bad discovery to make at the last minute, but it turns out that at some point, my Vulkan Validation misconfigured itself and actually turned off my synchronization checking, and now I get some synchronization errors that @akien-mga was reporting (not the error that was reported on the scene however, the visuals themselves are still fine). I realized this when I went to test another project and was wondering why I was not getting synchronization errors in a more obvious scenario.

I'd suggest avoiding to merge this until these synchronization errors are addressed, as there's quite a lot more than I thought there were due to the Validation layer turning itself off at some point during development.

EDIT: Upon further testing I can confirm when forcing the full barrier access bits most of the errors are gone at least, so the rendering graph logic itself seems fine, it just needs some further tweaking for correctness and analyzing what's missing from these cases.

@DarioSamo
Copy link
Contributor Author

I was able to solve most of the synchronization errors, although one of the solutions will probably remain a bit temporary until a more proper solution is figured out, but it's not exactly a pressing case as it involves an edge case with slices transitions (mostly due to how reflection probes behave).

There's another synchronization error in the TPS project, but it seems actually unrelated to the graph and it has more to do with the texture upload in particular of that project. It's worth checking if that error shows up as well in master at the moment or if #86855 might be related.

@Ansraer
Copy link
Contributor

Ansraer commented Jan 7, 2024

Oh thank god. I am building a PR on top of the RenderGraph and couldn't figure out why the layers were screaming at me when I hadn't even launched my new compute shader yet.

@DarioSamo
Copy link
Contributor Author

Oh thank god. I am building a PR on top of the RenderGraph and couldn't figure out why the layers were screaming at me when I hadn't even launched my new compute shader yet.

Were you using only validation or synchronization? I never saw errors with regular validation so far, but don't hesitate to report anything that might've been missed.

@akien-mga
Copy link
Member

I retested the latest version of this PR (d7ea8b7). I confirm that:

  • With buffer barriers (current PR), the skinning glitch I reproduce on Mesa radv is no longer present.
  • If I disable buffer barriers, the glitch comes back.

@DarioSamo
Copy link
Contributor Author

@akien-mga tested a standalone Vulkan sample that I created but we were unable to reproduce the glitch he's getting when using only memory barriers instead of buffer barriers. It seems it'll be much harder to trace what exactly is failing here and what part of the operations are corrupting it.

Adds a new system to automatically reorder commands, perform layout transitions and insert synchronization barriers based on the commands issued to RenderingDevice.
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most recent version looks good. I tested locally with the synchronization layer enabled and can confirm that the errors present in the last version are now gone.

At this point I am comfortable saying that this is ready for merging before Dev2.

@akien-mga akien-mga merged commit e9695d9 into godotengine:master Jan 9, 2024
15 checks passed
@akien-mga
Copy link
Member

Thanks and congrats, this is an amazing change! 🎉 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.