Enable on-demand deserialization of AST decls by tangent-vector · Pull Request #8095 · shader-slang/slang

tangent-vector · 2025-08-06T20:54:26Z

Overview

This change basically just flips a #define switch to enable the changes that were already checked in with PR #7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.

Impact

Let's look at some anecdotal performance numbers, collected on my dev box using a hello-world.exe from a Release x64 Windows build. The key performance stats from a build before this change are:

[*]              loadBuiltinModule      1         254.29ms
[*]       checkAllTranslationUnits      1           6.14ms

After this change, we see:

[*]              loadBuiltinModule      1          91.75ms
[*]       checkAllTranslationUnits      1          11.40ms

This change reduces the time spent in loadBuiltinModule() by just over 162ms, and increases the time spent in checkAllTranslationUnits() by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because loadBuiltinModule() is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.

For this example, the time spent in checkAllTranslationUnits() has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).

Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the hello-world example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in loadBuiltinModule() further. The data above also implies that loadBuiltinModule() may include large fixed overheads, which should also be scrutinized further.

Relationship to PR #7935

PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are:

[*]              loadBuiltinModule      1         176.62ms
[*]       checkAllTranslationUnits      1           6.04ms

It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.

Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR shader-slang#7935 ------------------------ PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.

ArielG-NV

Changes look good.

ArielG-NV · 2025-08-06T21:59:48Z

Additional note about the changes made:

Provided below is a flame-graph of the deserialization performance. Overall performance was improved across various tests, although, it looks like 60% of total time is still spent on deserializing capabilities (across various test programs). It may be a good idea to change how we serialize & deserialize this info (or change how we store it in the AST) now given that it is a large majority of our entire compile.

Debug build - running the hello-world Slang example:

Release build - running the hello-world Slang example:

tangent-vector · 2025-08-06T22:09:35Z

@ArielG-NV Thank you for the very detailed inspection of the performance; your analysis gives us a good candidate for where to look next when trying to identify potential performance gains.

The current serialization approach for capabilities is extremely naive, and I can see at least a few possible approaches to take that might help improve its peformance:

The nature of the capability system means that many of the capability "atoms" might be implied by other, more general, capabilities (e.g., D3D sm_4_0 is implied by sm_5_0). The serialization logic currently does not try to streamline the set of atoms it writes out, so a deeper interaction with the capability system might allow us to serialize only a small fraction of what we currently write.
In order to deal with capabilities that differ across platforms, stages, etc. each declaration stores multiple sets of capabilities, but in many cases it seems like these are identical or overlap significantly. It may be possible to deduplicate the relevant sets for a given declaration, and only (de)serialize the unique ones.
Similarly, it seems that many declarations will share identical capabilities to one another. It is possible that global deduplication of capabilities as part of serialization (or in the capability system itself) could greatly reduce the number of distinct sets that need to be serialized (and thus deserialized).

csyonghe · 2025-08-06T22:50:38Z

Yeah, most of these capability sets are identical. If we can reference capability sets the same way as Vals, a lot of this complexity will be gone. There are challenges treating capability set just as Vals, but it seems at least when it comes to storing in the AST, we should be able to have something like a CapabilityVal.

Another way to do it is to dedup capability set at module level during serialization, and store a per-module list of deduped capability sets. Then each capability set field on the individual ast nodes is just an int.

tangent-vector requested a review from a team as a code owner August 6, 2025 20:54

tangent-vector added the pr: non-breaking PRs without breaking changes label Aug 6, 2025

tangent-vector enabled auto-merge August 6, 2025 21:13

tangent-vector requested a review from ArielG-NV August 6, 2025 21:13

csyonghe approved these changes Aug 6, 2025

View reviewed changes

ArielG-NV approved these changes Aug 6, 2025

View reviewed changes

tangent-vector added this pull request to the merge queue Aug 6, 2025

Merged via the queue into shader-slang:master with commit b9f999b Aug 6, 2025
20 checks passed

This was referenced Aug 7, 2025

Reduce time taken for createGlobalSession #7338

Closed

slangc multiple '-o' don't work with multiple '-entry' when targeting DXBC (and most other) formats #8111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable on-demand deserialization of AST decls#8095

Enable on-demand deserialization of AST decls#8095
tangent-vector merged 1 commit intoshader-slang:masterfrom
tangent-vector:enable-on-demand-ast-deserialization

tangent-vector commented Aug 6, 2025

Uh oh!

ArielG-NV left a comment

Uh oh!

ArielG-NV commented Aug 6, 2025 •

edited

Loading

Uh oh!

tangent-vector commented Aug 6, 2025

Uh oh!

csyonghe commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tangent-vector commented Aug 6, 2025

Overview

Impact

Relationship to PR #7935

Uh oh!

ArielG-NV left a comment

Choose a reason for hiding this comment

Uh oh!

ArielG-NV commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tangent-vector commented Aug 6, 2025

Uh oh!

csyonghe commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArielG-NV commented Aug 6, 2025 •

edited

Loading

csyonghe commented Aug 6, 2025 •

edited

Loading