A new approach to AST node deduplication#8072
Merged
tangent-vector merged 3 commits intoshader-slang:masterfrom Aug 6, 2025
Merged
A new approach to AST node deduplication#8072tangent-vector merged 3 commits intoshader-slang:masterfrom
tangent-vector merged 3 commits intoshader-slang:masterfrom
Conversation
Background ---------- The Slang compiler currently relies on the idea that AST nodes derived from `Val` are always deduplicated based on their opcode and operands. Deduplication requires caching, and we thus have to determine the right scope at which to allocate and cache different `Val`s. We need to ensure that `Val`s that refer to declarations/nodes in a specific compilation session/linkage are not cached in a scope that will outlive that session/linkage (or else the cache will contain garbage pointers). Conversely, we also need to ensure that any `Val`s that are referred to by something like a builtin module's AST will remain alive at least as long as that module/AST, and that every compilation session/linkage that refers to that module will find that `Val` cached if they try to create an equivalent one. The existing approach to deduplication has some subtleties: * A builtin module like the core module needs to be fully loaded (using the `ASTBuilder` for the global session) before any other compilation session might construct `Val`s referring to nodes in the AST of that module. * The various declarations/types/values cached on the `SharedASTBuilder` need to be created before any user-defined compilation occurs, to ensure that all of the relevant `Val`s are created on the global session's AST builder (see `Session::finalizeSharedASTBuilder`). * Related to all of the above: the deduplication cache on a non-top-level `ASTBuilder` is initialized by copying the cache from the top-level (global session) `ASTBuilder` on creation. Any subsequent allocations on the global-session `ASTBuilder` will not be visible to the linkage-level `ASTBuilder`. This led to weird things like the *options parsing* logic for `-load-core-module` reaching in and performing a manual copy of the cache from the global session to a linkage to try to restore the invariants. * Notably, the deduplication logic doesn't actually care if two different linkages create distinct `Val`s that represent the same thing, so long as nothing in the AST of a builtin module will refer to that thing. E.g., if the builtin modules never mention `Ptr<String>`, then two different linkages that both refer to that type will likely construct distinct `Val`s to represent it. Thus the deduplication is not as complete as some might assume. The subtle ordering considerations when creating `Val`s related to builtin modules has proved to be a sticking point for implementing on-demand deserialization of the builtin modules (in order to improve startup times). Overview -------- This change implements a new approach that hopefully makes it easier to ensure correctness, even in the case where AST nodes for builtin modules and `Val`s that refer to those nodes sometimes get created after other compilation has been performed. The key points of the new approach are: * `ASTBuilder`s are now explicitly (rather than just implicitly) linked into a hierarchy. Currently the compiler codebase will only create a two-level hierarchy: the `ASTBuilder` for a global session is the parent to all of the `ASTBuilder`s created for `Linkage`s. The expectation is that the code can and will generalize to more levels of nesting. * When a request is made to create a `Val` (or find it in a cache), there is a single `ASTBuilder` that is determined to be responsible for owning that `Val` for its lifetime. The logic involved is comparable to the way that the `IRBuilder` decides where to insert a "hoistable" (deduplicated) instruction, with the simplification that we are dealing with a tree instead of a CFG. Details ------- * `Session::finalizeASTBuilder()` is gone, because it is no longer needed * The logic in the `ASTBuilder` constructor and in `slang-options.cpp` that was manually copying the `m_cachedNodes` from the global-session `ASTBuilder` over to a linkage's `ASTBuilder` is removed, since it is no longer needed. * Made every AST node (`NodeBase`) carry a pointer to the `ASTBuilder` that created it. This is wasteful, but makes it easier to be sure the implementation will work. * Introduced a class `RootASTBuilder`, derived from `ASTBuilder` to represent the root of a given hierarchy of AST builders. * Every non-root `ASTBuilder` is now constructed with a pointer to its parent builder. * Changed it so that instead of allocating a `SharedASTBuilder` and then passing it in to create one or more `ASTBuilder`s, the shared AST builder state is more of an implementation detail of the `ASTBuilder` type, and is automatically allocated behind the scenes as part of creating an `ASTBuilder`. * The inline (defined in header) `ASTBuilder::_getOrCreateImpl()` now just does a first-pass check for an existing cached `Val` and, if it doesn't find one, delegates to `ASTBuilder::_getOrCreateImplSlowPath()`, which encapsulates the logic for the cache-miss case (even if that logic is just two lines). * The `ASTBuilder::_findAppropriateASTBuilderForVal()` method inspects a `ValNodeDesc` to determine the "deepest" of the AST builders among its operands (falling back to the root AST builder if there are no relevant operands). * The `ASTBuilder::_getOrCreateValDirectly()` is intended for use when the correct AST builder to use for caching/allocation has already been identified. * Moved the caching of generic arguments for "default substitutions" out of `ASTBuilder` and onto `GenericDecl` itself. Note that the naming of the old field (`m_cachedGenericDefaultArgs`) was unclear about the fact that this is related to "default substitutions" (where each generic parameter is fed an argument that refers to the parameter itself), and has *nothing* to do with any default argument values that might be set on the generic parameters. * Changed the global session (`Session`) so that instead of storing pointers to both the root/builtin `ASTBuilder` *and* the corresponding `SharedASTBuilder`, it just retains a pointer to the root `ASTBuilder`. This is consistent with the move toward making the `SharedASTBuilder` just an implementation detail. Possible Future Work -------------------- * In theory this change should unblock on-demand AST deserialization, so it would be good to get back to those changes once this new approach lands. * Many AST node subclasses end up storing their own `ASTBuilder` pointers, which are likely now redundant with the one in `NodeBase`. These should be eliminated where possible. * A lot of code for AST-related manipulations has been changed over time to require an `ASTBuilder` to be passed in, but at this point such a builder should always be derive-able so long as the operation has at least one non-null AST node available to it. * Most of the accessors that are currently on `SharedASTBuilder` can/should be migrated to just be on `ASTBuilder`, where they can use the data from the `SharedASTBuilder` as part of their implementation. Ideally most code should only nee to interface with `ASTBuilder`s directly, and will never need to talk to the `SharedASTBuilder`. * This change cleaned up some of the ownership hierarchy, and made it so that the global session only retains a pointer to the root AST builder (and not the shared state as well). A logical follow-on for that would be to make the `Linkage` more properly own (and thus allocate) its own AST builder (and other related objects), and have the global session only store a pointer to the root linkage (and not point to the various sub-objects directly). * The `Linkage` and `SourceManager` types have their own form of parent/child hierarchy (restricted to two levels), and could be generalized in a way that is similar to what this change does for `ASTBuilder`. Support for a hierarchy of `Linkage`s could be a powerful tool for end users, if we expose it in the right way.
Contributor
Author
|
/format |
Contributor
|
🌈 Formatted, please merge the changes from this PR |
…lder Format code for PR shader-slang#8072
csyonghe
approved these changes
Aug 6, 2025
Collaborator
csyonghe
left a comment
There was a problem hiding this comment.
Looks good to me. This should speed up creation of new Linkage and hopefully allow us to enable on-demand deserialization as the next step.
tangent-vector
added a commit
to tangent-vector/slang
that referenced
this pull request
Aug 6, 2025
Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR shader-slang#7935 ------------------------ PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
github-merge-queue bot
pushed a commit
that referenced
this pull request
Aug 6, 2025
Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR #7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR #7935 ------------------------ PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
The Slang compiler currently relies on the idea that AST nodes derived from
Valare always deduplicated based on their opcode and operands. Deduplication requires caching, and we thus have to determine the right scope at which to allocate and cache differentVals.We need to ensure that
Vals that refer to declarations/nodes in a specific compilation session/linkage are not cached in a scope that will outlive that session/linkage (or else the cache will contain garbage pointers). Conversely, we also need to ensure that anyVals that are referred to by something like a builtin module's AST will remain alive at least as long as that module/AST, and that every compilation session/linkage that refers to that module will find thatValcached if they try to create an equivalent one.The existing approach to deduplication has some subtleties:
A builtin module like the core module needs to be fully loaded (using the
ASTBuilderfor the global session) before any other compilation session might constructVals referring to nodes in the AST of that module.The various declarations/types/values cached on the
SharedASTBuilderneed to be created before any user-defined compilation occurs, to ensure that all of the relevantVals are created on the global session's AST builder (seeSession::finalizeSharedASTBuilder).Related to all of the above: the deduplication cache on a non-top-level
ASTBuilderis initialized by copying the cache from the top-level (global session)ASTBuilderon creation. Any subsequent allocations on the global-sessionASTBuilderwill not be visible to the linkage-levelASTBuilder. This led to weird things like the options parsing logic for-load-core-modulereaching in and performing a manual copy of the cache from the global session to a linkage to try to restore the invariants.Notably, the deduplication logic doesn't actually care if two different linkages create distinct
Vals that represent the same thing, so long as nothing in the AST of a builtin module will refer to that thing. E.g., if the builtin modules never mentionPtr<String>, then two different linkages that both refer to that type will likely construct distinctVals to represent it. Thus the deduplication is not as complete as some might assume.The subtle ordering considerations when creating
Vals related to builtin modules has proved to be a sticking point for implementing on-demand deserialization of the builtin modules (in order to improve startup times).Overview
This change implements a new approach that hopefully makes it easier to ensure correctness, even in the case where AST nodes for builtin modules and
Vals that refer to those nodes sometimes get created after other compilation has been performed. The key points of the new approach are:ASTBuilders are now explicitly (rather than just implicitly) linked into a hierarchy. Currently the compiler codebase will only create a two-level hierarchy: theASTBuilderfor a global session is the parent to all of theASTBuilders created forLinkages. The expectation is that the code can and will generalize to more levels of nesting.When a request is made to create a
Val(or find it in a cache), there is a singleASTBuilderthat is determined to be responsible for owning thatValfor its lifetime. The logic involved is comparable to the way that theIRBuilderdecides where to insert a "hoistable" (deduplicated) instruction, with the simplification that we are dealing with a tree instead of a CFG.Details
Session::finalizeASTBuilder()is gone, because it is no longer neededThe logic in the
ASTBuilderconstructor and inslang-options.cppthat was manually copying them_cachedNodesfrom the global-sessionASTBuilderover to a linkage'sASTBuilderis removed, since it is no longer needed.Made every AST node (
NodeBase) carry a pointer to theASTBuilderthat created it. This is wasteful, but makes it easier to be sure the implementation will work.Introduced a class
RootASTBuilder, derived fromASTBuilderto represent the root of a given hierarchy of AST builders.Every non-root
ASTBuilderis now constructed with a pointer to its parent builder.Changed it so that instead of allocating a
SharedASTBuilderand then passing it in to create one or moreASTBuilders, the shared AST builder state is more of an implementation detail of theASTBuildertype, and is automatically allocated behind the scenes as part of creating anASTBuilder.The inline (defined in header)
ASTBuilder::_getOrCreateImpl()now just does a first-pass check for an existing cachedValand, if it doesn't find one, delegates toASTBuilder::_getOrCreateImplSlowPath(), which encapsulates the logic for the cache-miss case (even if that logic is just two lines).The
ASTBuilder::_findAppropriateASTBuilderForVal()method inspects aValNodeDescto determine the "deepest" of the AST builders among its operands (falling back to the root AST builder if there are no relevant operands).The
ASTBuilder::_getOrCreateValDirectly()is intended for use when the correct AST builder to use for caching/allocation has already been identified.Moved the caching of generic arguments for "default substitutions" out of
ASTBuilderand ontoGenericDeclitself. Note that the naming of the old field (m_cachedGenericDefaultArgs) was unclear about the fact that this is related to "default substitutions" (where each generic parameter is fed an argument that refers to the parameter itself), and has nothing to do with any default argument values that might be set on the generic parameters.Changed the global session (
Session) so that instead of storing pointers to both the root/builtinASTBuilderand the correspondingSharedASTBuilder, it just retains a pointer to the rootASTBuilder. This is consistent with the move toward making theSharedASTBuilderjust an implementation detail.Possible Future Work
In theory this change should unblock on-demand AST deserialization, so it would be good to get back to those changes once this new approach lands.
Many AST node subclasses end up storing their own
ASTBuilderpointers, which are likely now redundant with the one inNodeBase. These should be eliminated where possible.A lot of code for AST-related manipulations has been changed over time to require an
ASTBuilderto be passed in, but at this point such a builder should always be derive-able so long as the operation has at least one non-null AST node available to it.Most of the accessors that are currently on
SharedASTBuildercan/should be migrated to just be onASTBuilder, where they can use the data from theSharedASTBuilderas part of their implementation. Ideally most code should only nee to interface withASTBuilders directly, and will never need to talk to theSharedASTBuilder.This change cleaned up some of the ownership hierarchy, and made it so that the global session only retains a pointer to the root AST builder (and not the shared state as well). A logical follow-on for that would be to make the
Linkagemore properly own (and thus allocate) its own AST builder (and other related objects), and have the global session only store a pointer to the root linkage (and not point to the various sub-objects directly).The
LinkageandSourceManagertypes have their own form of parent/child hierarchy (restricted to two levels), and could be generalized in a way that is similar to what this change does forASTBuilder. Support for a hierarchy ofLinkages could be a powerful tool for end users, if we expose it in the right way.