Enable on-demand deserialization of AST decls#8095
Conversation
Overview -------- This change basically just flips a `#define` switch to enable the changes that were already checked in with PR shader-slang#7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR shader-slang#8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature. Impact ------ Let's look at some anecdotal performance numbers, collected on my dev box using a `hello-world.exe` from a Release x64 Windows build. The key performance stats from a build before this change are: ``` [*] loadBuiltinModule 1 254.29ms [*] checkAllTranslationUnits 1 6.14ms ``` After this change, we see: ``` [*] loadBuiltinModule 1 91.75ms [*] checkAllTranslationUnits 1 11.40ms ``` This change reduces the time spent in `loadBuiltinModule()` by just over 162ms, and increases the time spent in `checkAllTranslationUnits()` by about 5.25ms (the time spent in other compilation steps seems to be unaffected). Because `loadBuiltinModule()` is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain. For this example, the time spent in `checkAllTranslationUnits()` has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used). Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the `hello-world` example, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent in `loadBuiltinModule()` further. The data above also implies that `loadBuiltinModule()` may include large fixed overheads, which should also be scrutinized further. Relationship to PR shader-slang#7935 ------------------------ PR shader-slang#7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are: ``` [*] loadBuiltinModule 1 176.62ms [*] checkAllTranslationUnits 1 6.04ms ``` It remains to be seen how performance fares when this change and the optimizations in PR shader-slang#7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.
|
@ArielG-NV Thank you for the very detailed inspection of the performance; your analysis gives us a good candidate for where to look next when trying to identify potential performance gains. The current serialization approach for capabilities is extremely naive, and I can see at least a few possible approaches to take that might help improve its peformance:
|
|
Yeah, most of these capability sets are identical. If we can reference capability sets the same way as Another way to do it is to dedup capability set at module level during serialization, and store a per-module list of deduped capability sets. Then each capability set field on the individual ast nodes is just an int. |


Overview
This change basically just flips a
#defineswitch to enable the changes that were already checked in with PR #7482. That earlier change added the infrastructure required to do on-demand deserialization, but it couldn't be enabled at the time due to problematic interactions with the approach to AST node deduplication that was in place. PR #8072 introduced a new approach to AST node deduplication that eliminates the problematic interaction, and thus unblocks this feature.Impact
Let's look at some anecdotal performance numbers, collected on my dev box using a
hello-world.exefrom a Release x64 Windows build. The key performance stats from a build before this change are:After this change, we see:
This change reduces the time spent in
loadBuiltinModule()by just over 162ms, and increases the time spent incheckAllTranslationUnits()by about 5.25ms (the time spent in other compilation steps seems to be unaffected). BecauseloadBuiltinModule()is the most expensive step for trivial one-and-done compiles like this, reducing its execution time by over 60% is a big gain.For this example, the time spent in
checkAllTranslationUnits()has almost doubled, due to operations that force AST declarations from the core module to be deserialized. Note, however, that in cases where multiple modules are compiled using the same global session, that extra work should eventually amortize out, because each declaration from the core module can only be demand-loaded once (after which the in-memory version will be used).Because of some unrelated design choices in the compiler, loading of the core module causes approximately 17% of its top-level declarations to be demand-loaded. After compiling the code for the
hello-worldexample, approximately 20% of the top-level declarations have been demand-loaded. Further work could be done to reduce the number of core-module declarations that must always be deserialized, potentially reducing the time spent inloadBuiltinModule()further. The data above also implies thatloadBuiltinModule()may include large fixed overheads, which should also be scrutinized further.Relationship to PR #7935
PR #7935, which at this time hasn't yet been merged, implements several optimizations to overall deserialization performance. On a branch with those optimizations in place (but not this change), the corresponding timings are:
It remains to be seen how performance fares when this change and the optimizations in PR #7935 are combined. In principle, the two approaches are orthogonal, each attacking a different aspect of the performance problem. We thus expect the combination of the two to be better than either alone but, of course, testing will be required.