Skip to content

A new approach to AST node deduplication#8072

Merged
tangent-vector merged 3 commits intoshader-slang:masterfrom
tangent-vector:hierarchical-ast-builder
Aug 6, 2025
Merged

A new approach to AST node deduplication#8072
tangent-vector merged 3 commits intoshader-slang:masterfrom
tangent-vector:hierarchical-ast-builder

Conversation

@tangent-vector
Copy link
Copy Markdown
Contributor

Background

The Slang compiler currently relies on the idea that AST nodes derived from Val are always deduplicated based on their opcode and operands. Deduplication requires caching, and we thus have to determine the right scope at which to allocate and cache different Vals.

We need to ensure that Vals that refer to declarations/nodes in a specific compilation session/linkage are not cached in a scope that will outlive that session/linkage (or else the cache will contain garbage pointers). Conversely, we also need to ensure that any Vals that are referred to by something like a builtin module's AST will remain alive at least as long as that module/AST, and that every compilation session/linkage that refers to that module will find that Val cached if they try to create an equivalent one.

The existing approach to deduplication has some subtleties:

  • A builtin module like the core module needs to be fully loaded (using the ASTBuilder for the global session) before any other compilation session might construct Vals referring to nodes in the AST of that module.

  • The various declarations/types/values cached on the SharedASTBuilder need to be created before any user-defined compilation occurs, to ensure that all of the relevant Vals are created on the global session's AST builder (see Session::finalizeSharedASTBuilder).

  • Related to all of the above: the deduplication cache on a non-top-level ASTBuilder is initialized by copying the cache from the top-level (global session) ASTBuilder on creation. Any subsequent allocations on the global-session ASTBuilder will not be visible to the linkage-level ASTBuilder. This led to weird things like the options parsing logic for -load-core-module reaching in and performing a manual copy of the cache from the global session to a linkage to try to restore the invariants.

  • Notably, the deduplication logic doesn't actually care if two different linkages create distinct Vals that represent the same thing, so long as nothing in the AST of a builtin module will refer to that thing. E.g., if the builtin modules never mention Ptr<String>, then two different linkages that both refer to that type will likely construct distinct Vals to represent it. Thus the deduplication is not as complete as some might assume.

The subtle ordering considerations when creating Vals related to builtin modules has proved to be a sticking point for implementing on-demand deserialization of the builtin modules (in order to improve startup times).

Overview

This change implements a new approach that hopefully makes it easier to ensure correctness, even in the case where AST nodes for builtin modules and Vals that refer to those nodes sometimes get created after other compilation has been performed. The key points of the new approach are:

  • ASTBuilders are now explicitly (rather than just implicitly) linked into a hierarchy. Currently the compiler codebase will only create a two-level hierarchy: the ASTBuilder for a global session is the parent to all of the ASTBuilders created for Linkages. The expectation is that the code can and will generalize to more levels of nesting.

  • When a request is made to create a Val (or find it in a cache), there is a single ASTBuilder that is determined to be responsible for owning that Val for its lifetime. The logic involved is comparable to the way that the IRBuilder decides where to insert a "hoistable" (deduplicated) instruction, with the simplification that we are dealing with a tree instead of a CFG.

Details

  • Session::finalizeASTBuilder() is gone, because it is no longer needed

  • The logic in the ASTBuilder constructor and in slang-options.cpp that was manually copying the m_cachedNodes from the global-session ASTBuilder over to a linkage's ASTBuilder is removed, since it is no longer needed.

  • Made every AST node (NodeBase) carry a pointer to the ASTBuilder that created it. This is wasteful, but makes it easier to be sure the implementation will work.

  • Introduced a class RootASTBuilder, derived from ASTBuilder to represent the root of a given hierarchy of AST builders.

  • Every non-root ASTBuilder is now constructed with a pointer to its parent builder.

  • Changed it so that instead of allocating a SharedASTBuilder and then passing it in to create one or more ASTBuilders, the shared AST builder state is more of an implementation detail of the ASTBuilder type, and is automatically allocated behind the scenes as part of creating an ASTBuilder.

  • The inline (defined in header) ASTBuilder::_getOrCreateImpl() now just does a first-pass check for an existing cached Val and, if it doesn't find one, delegates to ASTBuilder::_getOrCreateImplSlowPath(), which encapsulates the logic for the cache-miss case (even if that logic is just two lines).

  • The ASTBuilder::_findAppropriateASTBuilderForVal() method inspects a ValNodeDesc to determine the "deepest" of the AST builders among its operands (falling back to the root AST builder if there are no relevant operands).

  • The ASTBuilder::_getOrCreateValDirectly() is intended for use when the correct AST builder to use for caching/allocation has already been identified.

  • Moved the caching of generic arguments for "default substitutions" out of ASTBuilder and onto GenericDecl itself. Note that the naming of the old field (m_cachedGenericDefaultArgs) was unclear about the fact that this is related to "default substitutions" (where each generic parameter is fed an argument that refers to the parameter itself), and has nothing to do with any default argument values that might be set on the generic parameters.

  • Changed the global session (Session) so that instead of storing pointers to both the root/builtin ASTBuilder and the corresponding SharedASTBuilder, it just retains a pointer to the root ASTBuilder. This is consistent with the move toward making the SharedASTBuilder just an implementation detail.

Possible Future Work

  • In theory this change should unblock on-demand AST deserialization, so it would be good to get back to those changes once this new approach lands.

  • Many AST node subclasses end up storing their own ASTBuilder pointers, which are likely now redundant with the one in NodeBase. These should be eliminated where possible.

  • A lot of code for AST-related manipulations has been changed over time to require an ASTBuilder to be passed in, but at this point such a builder should always be derive-able so long as the operation has at least one non-null AST node available to it.

  • Most of the accessors that are currently on SharedASTBuilder can/should be migrated to just be on ASTBuilder, where they can use the data from the SharedASTBuilder as part of their implementation. Ideally most code should only nee to interface with ASTBuilders directly, and will never need to talk to the SharedASTBuilder.

  • This change cleaned up some of the ownership hierarchy, and made it so that the global session only retains a pointer to the root AST builder (and not the shared state as well). A logical follow-on for that would be to make the Linkage more properly own (and thus allocate) its own AST builder (and other related objects), and have the global session only store a pointer to the root linkage (and not point to the various sub-objects directly).

  • The Linkage and SourceManager types have their own form of parent/child hierarchy (restricted to two levels), and could be generalized in a way that is similar to what this change does for ASTBuilder. Support for a hierarchy of Linkages could be a powerful tool for end users, if we expose it in the right way.

Background
----------

The Slang compiler currently relies on the idea that AST nodes derived from `Val` are always deduplicated based on their opcode and operands.
Deduplication requires caching, and we thus have to determine the right scope at which to allocate and cache different `Val`s.

We need to ensure that `Val`s that refer to declarations/nodes in a specific compilation session/linkage are not cached in a scope that will outlive that session/linkage (or else the cache will contain garbage pointers).
Conversely, we also need to ensure that any `Val`s that are referred to by something like a builtin module's AST will remain alive at least as long as that module/AST, and that every compilation session/linkage that refers to that module will find that `Val` cached if they try to create an equivalent one.

The existing approach to deduplication has some subtleties:

* A builtin module like the core module needs to be fully loaded (using the `ASTBuilder` for the global session) before any other compilation session might construct `Val`s referring to nodes in the AST of that module.

* The various declarations/types/values cached on the `SharedASTBuilder` need to be created before any user-defined compilation occurs, to ensure that all of the relevant `Val`s are created on the global session's AST builder (see `Session::finalizeSharedASTBuilder`).

* Related to all of the above: the deduplication cache on a non-top-level `ASTBuilder` is initialized by copying the cache from the top-level (global session) `ASTBuilder` on creation. Any subsequent allocations on the global-session `ASTBuilder` will not be visible to the linkage-level `ASTBuilder`. This led to weird things like the *options parsing* logic for `-load-core-module` reaching in and performing a manual copy of the cache from the global session to a linkage to try to restore the invariants.

* Notably, the deduplication logic doesn't actually care if two different linkages create distinct `Val`s that represent the same thing, so long as nothing in the AST of a builtin module will refer to that thing. E.g., if the builtin modules never mention `Ptr<String>`, then two different linkages that both refer to that type will likely construct distinct `Val`s to represent it. Thus the deduplication is not as complete as some might assume.

The subtle ordering considerations when creating `Val`s related to builtin modules has proved to be a sticking point for implementing on-demand deserialization of the builtin modules (in order to improve startup times).

Overview
--------

This change implements a new approach that hopefully makes it easier to ensure correctness, even in the case where AST nodes for builtin modules and `Val`s that refer to those nodes sometimes get created after other compilation has been performed.
The key points of the new approach are:

* `ASTBuilder`s are now explicitly (rather than just implicitly) linked into a hierarchy. Currently the compiler codebase will only create a two-level hierarchy: the `ASTBuilder` for a global session is the parent to all of the `ASTBuilder`s created for `Linkage`s. The expectation is that the code can and will generalize to more levels of nesting.

* When a request is made to create a `Val` (or find it in a cache), there is a single `ASTBuilder` that is determined to be responsible for owning that `Val` for its lifetime. The logic involved is comparable to the way that the `IRBuilder` decides where to insert a "hoistable" (deduplicated) instruction, with the simplification that we are dealing with a tree instead of a CFG.

Details
-------

* `Session::finalizeASTBuilder()` is gone, because it is no longer needed

* The logic in the `ASTBuilder` constructor and in `slang-options.cpp` that was manually copying the `m_cachedNodes` from the global-session `ASTBuilder` over to a linkage's `ASTBuilder` is removed, since it is no longer needed.

* Made every AST node (`NodeBase`) carry a pointer to the `ASTBuilder` that created it. This is wasteful, but makes it easier to be sure the implementation will work.

* Introduced a class `RootASTBuilder`, derived from `ASTBuilder` to represent the root of a given hierarchy of AST builders.

* Every non-root `ASTBuilder` is now constructed with a pointer to its parent builder.

* Changed it so that instead of allocating a `SharedASTBuilder` and then passing it in to create one or more `ASTBuilder`s, the shared AST builder state is more of an implementation detail of the `ASTBuilder` type, and is automatically allocated behind the scenes as part of creating an `ASTBuilder`.

* The inline (defined in header) `ASTBuilder::_getOrCreateImpl()` now just does a first-pass check for an existing cached `Val` and, if it doesn't find one, delegates to `ASTBuilder::_getOrCreateImplSlowPath()`, which encapsulates the logic for the cache-miss case (even if that logic is just two lines).

* The `ASTBuilder::_findAppropriateASTBuilderForVal()` method inspects a `ValNodeDesc` to determine the "deepest" of the AST builders among its operands (falling back to the root AST builder if there are no relevant operands).

* The `ASTBuilder::_getOrCreateValDirectly()` is intended for use when the correct AST builder to use for caching/allocation has already been identified.

* Moved the caching of generic arguments for "default substitutions" out of `ASTBuilder` and onto `GenericDecl` itself. Note that the naming of the old field (`m_cachedGenericDefaultArgs`) was unclear about the fact that this is related to "default substitutions" (where each generic parameter is fed an argument that refers to the parameter itself), and has *nothing* to do with any default argument values that might be set on the generic parameters.

* Changed the global session (`Session`) so that instead of storing pointers to both the root/builtin `ASTBuilder` *and* the corresponding `SharedASTBuilder`, it just retains a pointer to the root `ASTBuilder`. This is consistent with the move toward making the `SharedASTBuilder` just an implementation detail.

Possible Future Work
--------------------

* In theory this change should unblock on-demand AST deserialization, so it would be good to get back to those changes once this new approach lands.

* Many AST node subclasses end up storing their own `ASTBuilder` pointers, which are likely now redundant with the one in `NodeBase`. These should be eliminated where possible.

* A lot of code for AST-related manipulations has been changed over time to require an `ASTBuilder` to be passed in, but at this point such a builder should always be derive-able so long as the operation has at least one non-null AST node available to it.

* Most of the accessors that are currently on `SharedASTBuilder` can/should be migrated to just be on `ASTBuilder`, where they can use the data from the `SharedASTBuilder` as part of their implementation. Ideally most code should only nee to interface with `ASTBuilder`s directly, and will never need to talk to the `SharedASTBuilder`.

* This change cleaned up some of the ownership hierarchy, and made it so that the global session only retains a pointer to the root AST builder (and not the shared state as well). A logical follow-on for that would be to make the `Linkage` more properly own (and thus allocate) its own AST builder (and other related objects), and have the global session only store a pointer to the root linkage (and not point to the various sub-objects directly).

* The `Linkage` and `SourceManager` types have their own form of parent/child hierarchy (restricted to two levels), and could be generalized in a way that is similar to what this change does for `ASTBuilder`. Support for a hierarchy of `Linkage`s could be a powerful tool for end users, if we expose it in the right way.
@tangent-vector tangent-vector requested a review from a team as a code owner August 5, 2025 23:59
@tangent-vector tangent-vector added the pr: non-breaking PRs without breaking changes label Aug 5, 2025
@tangent-vector
Copy link
Copy Markdown
Contributor Author

/format

@slangbot
Copy link
Copy Markdown
Contributor

slangbot commented Aug 6, 2025

🌈 Formatted, please merge the changes from this PR

Copy link
Copy Markdown
Collaborator

@csyonghe csyonghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. This should speed up creation of new Linkage and hopefully allow us to enable on-demand deserialization as the next step.

@csyonghe csyonghe added this pull request to the merge queue Aug 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 6, 2025
@tangent-vector tangent-vector added this pull request to the merge queue Aug 6, 2025
Merged via the queue into shader-slang:master with commit 3525765 Aug 6, 2025
18 checks passed
tangent-vector added a commit to tangent-vector/slang that referenced this pull request Aug 6, 2025
Overview
--------

This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482.
That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place.
PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.

Impact
------

Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build.
The key performance stats from a build before this change are:

```
[*]              loadBuiltinModule      1         254.29ms
[*]       checkAllTranslationUnits      1           6.14ms
```

After this change, we see:

```
[*]              loadBuiltinModule      1          91.75ms
[*]       checkAllTranslationUnits      1          11.40ms
```

This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected).
Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.

For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized.
Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).

Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded.
After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded.
Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further.
The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further.

Relationship to PR shader-slang#7935
------------------------

PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance.
On a branch with those optimizations in place (but not this change), the corresponding timings are:

```
[*]              loadBuiltinModule      1         176.62ms
[*]       checkAllTranslationUnits      1           6.04ms
```

It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined.
In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem.
We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
github-merge-queue bot pushed a commit that referenced this pull request Aug 6, 2025
Overview
--------

This change basically just flips a `#define` switch to enable the changes that were already checked in with PR #7482.
That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place.
PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.

Impact
------

Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build.
The key performance stats from a build before this change are:

```
[*]              loadBuiltinModule      1         254.29ms
[*]       checkAllTranslationUnits      1           6.14ms
```

After this change, we see:

```
[*]              loadBuiltinModule      1          91.75ms
[*]       checkAllTranslationUnits      1          11.40ms
```

This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected).
Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.

For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized.
Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).

Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded.
After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded.
Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further.
The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further.

Relationship to PR #7935
------------------------

PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance.
On a branch with those optimizations in place (but not this change), the corresponding timings are:

```
[*]              loadBuiltinModule      1         176.62ms
[*]       checkAllTranslationUnits      1           6.04ms
```

It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined.
In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem.
We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr: non-breaking PRs without breaking changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants