Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.4] Split Vertex Buffer Stream in Positions and Attributes #46574

Merged
merged 1 commit into from
Jul 20, 2021

Conversation

The-O-King
Copy link
Contributor

@The-O-King The-O-King commented Mar 1, 2021

Provide an import option for enabling vertex stream splitting - where the vertex
buffer is organized into a contiguous block of position data and a contiguous
block of interleaved attribute data

This is an optimization particularly for mobile, tile-based renderers which can
use less memory bandwidth during their binning phase

Implementation of godotengine/godot-proposals#2350

@fire
Copy link
Member

fire commented Mar 1, 2021

Is there a godot master version I can review?

@The-O-King
Copy link
Contributor Author

At the moment I only have this implemented in 3.2 since 4.0 is still undergoing work on the rendering side from what I understand

I think discussion is still ongoing as to what/if 4.0 will need something like this implemented (see the proposal linked for discussion)

@akien-mga akien-mga added this to the 3.2 milestone Mar 2, 2021
Base automatically changed from 3.2 to 3.x March 16, 2021 11:11
@aaronfranke aaronfranke modified the milestones: 3.2, 3.3, 3.4 Mar 16, 2021
@aaronfranke aaronfranke changed the title [3.2] Split Vertex Buffer Stream in Positions and Attributes, Provide UI Options [3.4] Split Vertex Buffer Stream in Positions and Attributes, Provide UI Options Mar 18, 2021
@clayjohn
Copy link
Member

clayjohn commented May 6, 2021

What do you think about making this a ProjectSetting instead of an import setting. I don't see a reason to import some meshes using the split vertex stream and some without. To me it seems like Desktop should use non-split and mobile should use split. In which case there can be a check in the rasterizer. I think making this a project setting would result in a simpler PR and would be more user-friendly.

Please let me know if there is a good reason to have this in the importer.

@lawnjelly
Copy link
Member

Also have you got any timing figures for the performance with and without the change on different hardware? It would be kind of nice to have some figures to see the difference.

@The-O-King
Copy link
Contributor Author

Ah that's a good thought and not something that I had considered before, looking at it I think that a project setting would make more sense also, I'll look into what that would take! Thank you for pointing me towards that!

For performance, I know that for mobile it helps primarily with memory bandwidth metrics (and was the primary motivation for this change, 50% improvement) rather than gpu frame times, and I did do some cursory testing on desktop (nvidia 2070 super Max-Q laptop) and did not find much in the way of performance difference, but I will try to get more concrete numbers for that as well to share

@The-O-King
Copy link
Contributor Author

Implementation question - is there any way to trigger a reimport of all mesh assets when a project setting is changed? At the moment someone could change the project setting, but all previously imported assets would still have the old mesh data format and they would have to reimport the assets manually

@Calinou
Copy link
Member

Calinou commented May 10, 2021

Implementation question - is there any way to trigger a reimport of all mesh assets when a project setting is changed? At the moment someone could change the project setting, but all previously imported assets would still have the old mesh data format and they would have to reimport the assets manually

As far as I know, there is no way to do that yet (aside of removing the .import/ folder). This is also an issue when changing the VRAM compressed texture formats in the Project Settings.

@The-O-King
Copy link
Contributor Author

Implementation question - is there any way to trigger a reimport of all mesh assets when a project setting is changed? At the moment someone could change the project setting, but all previously imported assets would still have the old mesh data format and they would have to reimport the assets manually

As far as I know, there is no way to do that yet (aside of removing the .import/ folder). This is also an issue when changing the VRAM compressed texture formats in the Project Settings.

Gotcha, thanks for letting me know! Maybe we can make that an option in the future haha :)

For now I was able to update the implementation to use project settings rather than import settings for splitting the vertex buffer streams, and will have similar functionality to the way that the VRAM texture compression settings will behave then

I also removed the commit that added the mesh compression options since it is much less relevant now to this PR, and created a new PR with that feature addition here #48625

@The-O-King The-O-King changed the title [3.4] Split Vertex Buffer Stream in Positions and Attributes, Provide UI Options [3.4] Split Vertex Buffer Stream in Positions and Attributes May 10, 2021
attribs[i].type = _GL_HALF_FLOAT_OES;
stride += attribs[i].size * 2;
uses_half_float = true;
attribs[i].type = GL_HALF_FLOAT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is a copy-paste error from GLES3. This line is causing the CI to fail because it should be _GL_HALF_FLOAT_OES

@clayjohn
Copy link
Member

Implementation looks good on a preliminary review!

Have you had any luck getting performance numbers on mobile? I would be really interested to see if there is a measurable difference. The TPS demo may be a good test project if you don't already have one.

servers/visual_server.cpp Outdated Show resolved Hide resolved
servers/visual_server.cpp Outdated Show resolved Hide resolved
servers/visual_server.cpp Outdated Show resolved Hide resolved
@The-O-King
Copy link
Contributor Author

I have figured out the clang-format issue so hopefully static checker will be happy from now on :)

Oh yes I definitely have the numbers on mobile to show this haha, I'm working on getting the screenshots right now :)

Also working on getting the numbers for desktop using nvidia nsights, will also report that as well when I get them!

@The-O-King
Copy link
Contributor Author

I also collected data for desktop and at the moment it looks like performance is basically identical - This was in a static scene with 90,000,000 verts that were offscreen, so I would say having it on desktop will at least keep the same performance

In Nvidia Nsight, the bottleneck here is in VRAM (both have it pinned at 97% SOL) and the L2 cache behavior is also basically the same (46% hit rate in this scene). There was a reduction in frame time (average 43ms to 41ms) but I would chalk that up to occasional power management/throttling

Unsplit:
2021-05-12 - unsplit - desktop (2)
Split:
2021-05-12 - split - desktop (2)

I also did a test with the same scene but an additional dynamic light/shadow (but reduced down to 60,000,000 to make it a more reasonable framerate) and I got the same exact results between split and unsplit - although when I was checking things in renderdoc - I did notice that the shadow map generating draw calls had the vertex normal enabled - and I was wondering what they were used for in this scenario? As that could be a good area for improvement on both desktop and mobile

This is also more anecdotal evidence, but I would also note from performance profiling other games on desktop (such as Apex Legends, or the default vertex buffer format in Unreal Engine) they all usually have some sort of vertex buffer stream splitting in place where positions are their own stream and attributes are distributed amongst other separate buffers, grouped by what data is most commonly used together in a shader/renderpass

@The-O-King
Copy link
Contributor Author

Ah sorry I think I misunderstood before about the static vs dynamic mesh thing - yes this is applied to static meshes only

I see definitely see where you're coming from in terms of seeing real-world improvements, but I think the TPS scene in particular isn't the best way to show the improvements one would get from something like this because it's unreasonably pixel bound haha, at least on mobile :P I think in all scenarios though it will come out with same or improved performance

We do have the data for static mesh scenes (single or not shouldn't make a difference) on mobile since that was what the first scenario I presented was testing (to be fair it wasn't a "single" static mesh but 30 copies of the same high poly single mesh, which I would argue end up with similar effect)

And for tile based renderers (which are all mobile GPUs) - rendering to a single tile vs offscreen would show the same performance improvements that the first test scenario I presented showed, just with the added pixel shader work for fragments that end up on screen, regardless of the tile - the binning process occurs for ALL vertices before shading fragments (or the rest of the vertex calculations) ever begins

For the power argument, I don't think I have a good way to measure the power improvement without running an example load/app on a device and running it until it dies haha - but we know that's it's a recommendation that both ARM[1][2] and Qualcomm make in their best practices guides
nvidia generally mentions stream splitting can help in some places for doing things like shadow/occlusion passes

@lawnjelly
Copy link
Member

If performance is within a gnat's todger with the split vertex format for static buffers on desktop, then to me that does sound a fair argument for just using split as the path for all static buffers, and not bothering with the split / non-split option. Unless there is a need for backward compatibility. What do you think @clayjohn ? 😄

@clayjohn
Copy link
Member

@lawnjelly I agree that we need to do more performance analysis before we commit to anything.

In theory this PR should result in performance benefits in some cases on mobile. And in theory it should degrade performance on desktop. The reduction in performance comes from the cache pressure resulting from reading the vertex position from one end of the vertex array while reading the other vertex attributes from another position in the array. Again, the performance consideration is only theoretical and may pan out to be negligible.

@Calinou
Copy link
Member

Calinou commented May 18, 2021

@The-O-King Can you upload the vertex-heavy test project you used for testing (not the TPS demo)? This way, we can check the performance impact on desktop platforms. Thanks in advance 🙂

@The-O-King
Copy link
Contributor Author

Oh yes sure thing - here's a google drive link to a zip of the project I've been using to test things - there's a sphere_scene_desktop.tscn which is what I used to measure the performance - currently there's a moving point light with shadows as well for when I was profiling that scenario

https://drive.google.com/file/d/1X_Dc9NJfN6p8xYf4-g64SVuOEJCxJzrW/view?usp=sharing

Let me know if this drive link works, or if there's a better/preferred way to share the project :)

@Calinou
Copy link
Member

Calinou commented May 19, 2021

Let me know if this drive link works, or if there's a better/preferred way to share the project :)

The drive link works, thanks 🙂

@lawnjelly
Copy link
Member

Nice little discussion of this on twitter today too, and on when the situation changed on this (looks like around 2015/2016?):
https://twitter.com/promit_roy/status/1394847846759911427

@lawnjelly
Copy link
Member

lawnjelly commented May 20, 2021

Just been trying the PR on my desktop (Linux Mint 20, Intel HD Graphics 630) and seeing if there's any difference.

With this quick contrived test I can get 129fps without splitting and 126fps with splitting, but I doubt there is much in it in practice. My test was procedurally generating a mesh with 1000000 triangles (3000000 unique verts) with uv, uv2, color, normal and tangent, and very small screen size and identical small triangle size. I don't even know if it uses that many verts or limits to 65535 in fact .. it could do in GLES3 though.

Be interesting to try on more hardware these kinds of tests, especially mobile.
SplitVertBufferTest.zip

@The-O-King
Copy link
Contributor Author

I got a chance today to run your test project (really elegant btw I love it haha) and on my laptop which has a RTX 2070 MAX-Q I got identical performance when using split vs unsplit vertex buffers (avg 708.66fps)

I was able to also capture some traces on my Pixel 4 and while average framerate was identical (avg 13.66fps) I did see a drop in vertex memory bandwidth in a frame (this benefit coming directly from bandwidth improvements during the binning phase I mentioned earlier) where for unsplit I saw peak and average bandwidth of 13.28GB/s and 3.7GB/s respectively, while on the split buffer test I saw peak and average bandwidth of 4.9GB/s and 2.8GB/s respectively.
The improvement seen in the average stems from the improvement during the binning phase of a frame, otherwise during the rendering phase of a frame we see that the vertex bandwidth is pretty much the same (avg 2GB/s) whether split or unsplit (which is to be expected)

I've included links for you to download the Android GPU Inspector trace files I've taken in case you were interested at looking at the data yourself! You just need to have Android GPU Inspector downloaded to take a peek :)

vertex buffer split
vertex buffer unsplit

@lawnjelly
Copy link
Member

I was able to also capture some traces on my Pixel 4 and while average framerate was identical (avg 13.66fps)

That's quite interesting in itself. I deliberately used identical position triangles and scaled small so that hopefully they would end up in the same tile. The idea being that this would stress the tile based renderer the most as it didn't get the opportunity to spread the geometry over several tiles. But perhaps there is something we are missing and we could run a better test - I'm no expert on tiled renderers. I'm sure a hardware guy could tell us more.

@The-O-King
Copy link
Contributor Author

Hello! Just wanted to bump this and see what the status is right now? Sorry it's been a while I got pulled into other work that's taken up my time

So what are the current concerns about this change? Are we still unsure about it's benefits? My current understanding is that on mobile we see the memory bandwidth reductions we were expecting (and implicitly that is battery life improvement as well) and on desktop performance was largely the same (at least on the hardware we've tested so far)

@The-O-King
Copy link
Contributor Author

@clayjohn hello! Just wanted to bump this again, and see if there was anything I could do to help :) Thanks!

@akien-mga
Copy link
Member

There's a couple new project settings added by this PR which should be included in the docs, see the diff here: https://github.com/godotengine/godot/pull/46574/checks?check_run_id=2559453240

You can generate the template with godot --doctool with a build of your PR, and then ideally fill the descriptions.

@akien-mga akien-mga requested a review from clayjohn July 14, 2021 16:08
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, subject to addition of docs for the project settings.

@The-O-King The-O-King requested a review from a team as a code owner July 16, 2021 18:13
@The-O-King
Copy link
Contributor Author

Update the documentation with doctool so hopefully things are ready after this!

Thanks for taking a look everyone!

@akien-mga
Copy link
Member

Could you rebase to squash the commits? See PR workflow for instructions.

@The-O-King
Copy link
Contributor Author

Oh thank you for letting me know about that I hadn't seen that before, was able to squash everything, let me know if everything is now in order

@@ -1187,6 +1187,9 @@
<member name="rendering/lossless_compression/webp_compression_level" type="int" setter="" getter="" default="2">
The default compression level for lossless WebP. Higher levels result in smaller files at the cost of compression speed. Decompression speed is mostly unaffected by the compression level. Supported values are 0 to 9. Note that compression levels above 6 are very slow and offer very little savings.
</member>
<member name="rendering/mesh_storage/split_stream" type="bool" setter="" getter="" default="false">
On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enable if targeting mobile devices. Requires manual reimport of meshes after toggling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enable if targeting mobile devices. Requires manual reimport of meshes after toggling
On import, mesh vertex data will be split into two streams within a single vertex buffer, one for position data and the other for interleaved attributes data. Recommended to be enabled if targeting mobile devices. Requires manual reimport of meshes after toggling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the word here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You missed it in my diff but there's also a dot that should be added at the end of the sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot you're so right sorry about that I hope that takes care of it now!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might have missed pushing that last amend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right I noticed in the logs that I was getting unresolved host errors in my repo for some reason

Implemented splitting of vertex positions and attributes in the vertex
buffer

Positions are sequential at the start of the buffer, followed by the
additional attributes which are interleaved

Made a project setting which enables/disabled the buffer formatting
throughout the project

Implemented in both GLES2 and GLES3

This improves performance particularly on tile-based GPUs as well as
cache performance for something like shadow mapping which only needs
position data

Updated Docs and Project Setting
@akien-mga akien-mga merged commit f131a77 into godotengine:3.x Jul 20, 2021
@akien-mga
Copy link
Member

Thanks! And congrats for your first merged Godot contribution 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants