Skip to content

Enable on-demand deserialization of AST decls#8095

Merged
tangent-vector merged 1 commit intoshader-slang:masterfrom
tangent-vector:enable-on-demand-ast-deserialization
Aug 6, 2025
Merged

Enable on-demand deserialization of AST decls#8095
tangent-vector merged 1 commit intoshader-slang:masterfrom
tangent-vector:enable-on-demand-ast-deserialization

Conversation

@tangent-vector
Copy link
Copy Markdown
Contributor

Overview

This change basically just flips a #define switch to enable the changes that were already checked in with PR #7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.

Impact

Let's look at some anecdotal performance numbers, collected on my dev box using a hello-world.exe from a Release x64 Windows build. The key performance stats from a build before this change are:

[*]              loadBuiltinModule      1         254.29ms
[*]       checkAllTranslationUnits      1           6.14ms

After this change, we see:

[*]              loadBuiltinModule      1          91.75ms
[*]       checkAllTranslationUnits      1          11.40ms

This change reduces the time spent in loadBuiltinModule() by just over 162ms, and increases the time spent in checkAllTranslationUnits() by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because loadBuiltinModule() is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.

For this example, the time spent in checkAllTranslationUnits() has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).

Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the hello-world example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in loadBuiltinModule() further. The data above also implies that loadBuiltinModule() may include large fixed overheads, which should also be scrutinized further.

Relationship to PR #7935

PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are:

[*]              loadBuiltinModule      1         176.62ms
[*]       checkAllTranslationUnits      1           6.04ms

It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.

Overview
--------

This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482.
That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place.
PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.

Impact
------

Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build.
The key performance stats from a build before this change are:

```
[*]              loadBuiltinModule      1         254.29ms
[*]       checkAllTranslationUnits      1           6.14ms
```

After this change, we see:

```
[*]              loadBuiltinModule      1          91.75ms
[*]       checkAllTranslationUnits      1          11.40ms
```

This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected).
Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.

For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized.
Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).

Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded.
After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded.
Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further.
The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further.

Relationship to PR shader-slang#7935
------------------------

PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance.
On a branch with those optimizations in place (but not this change), the corresponding timings are:

```
[*]              loadBuiltinModule      1         176.62ms
[*]       checkAllTranslationUnits      1           6.04ms
```

It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined.
In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem.
We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
@tangent-vector tangent-vector requested a review from a team as a code owner August 6, 2025 20:54
@tangent-vector tangent-vector added the pr: non-breaking PRs without breaking changes label Aug 6, 2025
Copy link
Copy Markdown
Contributor

@ArielG-NV ArielG-NV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.

@ArielG-NV
Copy link
Copy Markdown
Contributor

ArielG-NV commented Aug 6, 2025

Additional note about the changes made:

Provided below is a flame-graph of the deserialization performance. Overall performance was improved across various tests, although, it looks like 60% of total time is still spent on deserializing capabilities (across various test programs). It may be a good idea to change how we serialize & deserialize this info (or change how we store it in the AST) now given that it is a large majority of our entire compile.

Debug build - running the hello-world Slang example:

image

Release build - running the hello-world Slang example:

image

@tangent-vector
Copy link
Copy Markdown
Contributor Author

@ArielG-NV Thank you for the very detailed inspection of the performance; your analysis gives us a good candidate for where to look next when trying to identify potential performance gains.

The current serialization approach for capabilities is extremely naive, and I can see at least a few possible approaches to take that might help improve its peformance:

  • The nature of the capability system means that many of the capability "atoms" might be implied by other, more general, capabilities (e.g., D3D sm_4_0 is implied by sm_5_0). The serialization logic currently does not try to streamline the set of atoms it writes out, so a deeper interaction with the capability system might allow us to serialize only a small fraction of what we currently write.

  • In order to deal with capabilities that differ across platforms, stages, etc. each declaration stores multiple sets of capabilities, but in many cases it seems like these are identical or overlap significantly. It may be possible to deduplicate the relevant sets for a given declaration, and only (de)serialize the unique ones.

  • Similarly, it seems that many declarations will share identical capabilities to one another. It is possible that global deduplication of capabilities as part of serialization (or in the capability system itself) could greatly reduce the number of distinct sets that need to be serialized (and thus deserialized).

@tangent-vector tangent-vector added this pull request to the merge queue Aug 6, 2025
@csyonghe
Copy link
Copy Markdown
Collaborator

csyonghe commented Aug 6, 2025

Yeah, most of these capability sets are identical. If we can reference capability sets the same way as Vals, a lot of this complexity will be gone. There are challenges treating capability set just as Vals, but it seems at least when it comes to storing in the AST, we should be able to have something like a CapabilityVal.

Another way to do it is to dedup capability set at module level during serialization, and store a per-module list of deduped capability sets. Then each capability set field on the individual ast nodes is just an int.

Merged via the queue into shader-slang:master with commit b9f999b Aug 6, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr: non-breaking PRs without breaking changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants