Children/Descendants: improve traversal performance (closes #22646)#22742
Conversation
…ithout materialising.
|
Claude finished @AndyButland's task in 8m 53s —— View job PR ReviewTarget: Six targeted, non-breaking performance fixes for
Important
Suggestions
Approved with Suggestions for improvementThis is a well-reasoned, well-tested set of targeted optimisations with strong benchmark evidence. The Labels applied: |
There was a problem hiding this comment.
Pull request overview
This PR targets the reported v17 HybridCache regression for IPublishedContent.Children() / Descendants() by removing avoidable eager work and adding caching/fast-paths in the published-cache traversal pipeline, with accompanying unit tests and benchmarks.
Changes:
- Removes redundant sorting/materialization in traversal-related helpers to enable true short-circuit enumeration.
- Adds lazy, thread-safe property wrapper creation in
PublishedContent, cached ordered-children inNavigationNode, and per-snapshot descendants caching in navigation services. - Introduces an L0 synchronous fast path for
GetById(Guid)via newTryGetCachedcache-service APIs, plus new tests and BenchmarkDotNet benchmarks.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Umbraco.Tests.UnitTests/Umbraco.PublishedCache.HybridCache/PublishedContentTests.cs | Adds unit tests for lazy, thread-safe property initialization behavior in PublishedContent. |
| tests/Umbraco.Tests.UnitTests/Umbraco.PublishedCache.HybridCache/DocumentCacheSyncFastPathTests.cs | Verifies the new sync L0 fast path avoids the async fallback on cache hits. |
| tests/Umbraco.Tests.UnitTests/Umbraco.Core/Services/PublishStatus/PublishedMediaStatusFilteringServiceTests.cs | Adds tests asserting FilterAvailable is lazy and short-circuits cache lookups. |
| tests/Umbraco.Tests.UnitTests/Umbraco.Core/Services/PublishStatus/PublishedContentStatusFilteringServiceTests.cs | Adds laziness/short-circuit tests for content filtering; removes unnecessary partial. |
| tests/Umbraco.Tests.UnitTests/Umbraco.Core/Services/Navigation/ContentNavigationDescendantsCacheTests.cs | Adds tests covering per-snapshot descendants caching and invalidation on mutations/concurrency. |
| tests/Umbraco.Tests.UnitTests/Umbraco.Core/Models/Navigation/NavigationNodeTests.cs | Adds tests for ordered-children caching behavior and invalidation/thread-safety. |
| tests/Umbraco.Tests.Benchmarks/HybridCacheNavigationBenchmarks.cs | Introduces benchmarks for Children/Descendants traversal patterns and short-circuit scenarios. |
| tests/Umbraco.Tests.Benchmarks/Fixtures/SyntheticPublishedTreeFixture.cs | Provides an in-memory synthetic tree + HybridCache wiring for benchmarks. |
| src/Umbraco.PublishedCache.HybridCache/Umbraco.PublishedCache.HybridCache.csproj | Grants InternalsVisibleTo to the benchmarks project. |
| src/Umbraco.PublishedCache.HybridCache/Services/MediaCacheService.cs | Implements IMediaCacheService.TryGetCached for L0 sync hits. |
| src/Umbraco.PublishedCache.HybridCache/Services/DocumentCacheService.cs | Implements IDocumentCacheService.TryGetCached for L0 sync hits (non-preview). |
| src/Umbraco.PublishedCache.HybridCache/PublishedContent.cs | Switches property wrapper construction to lazy, thread-safe initialization. |
| src/Umbraco.PublishedCache.HybridCache/MediaCache.cs | Uses TryGetCached to avoid sync-over-async on warm L0 hits. |
| src/Umbraco.PublishedCache.HybridCache/DocumentCache.cs | Uses TryGetCached to avoid sync-over-async on warm L0 hits. |
| src/Umbraco.Core/Services/PublishStatus/PublishedMediaStatusFilteringService.cs | Makes FilterAvailable lazy (removes eager .ToArray()). |
| src/Umbraco.Core/Services/PublishStatus/PublishedContentStatusFilteringService.cs | Makes FilterAvailable lazy (removes eager .ToArray()). |
| src/Umbraco.Core/Services/Navigation/ContentNavigationServiceBase.cs | Adds per-snapshot descendants caching + generation-guard and invalidation on mutations; uses cached ordered-children. |
| src/Umbraco.Core/PublishedCache/IMediaCacheService.cs | Adds default-implementation TryGetCached API for sync L0 fast path. |
| src/Umbraco.Core/PublishedCache/IDocumentCacheService.cs | Adds default-implementation TryGetCached API for sync L0 fast path. |
| src/Umbraco.Core/Models/Navigation/NavigationNode.cs | Adds lazy cached ordered-children snapshot with explicit invalidation. |
| src/Umbraco.Core/Extensions/PublishedContentExtensions.cs | Removes redundant OrderBy(SortOrder) in children retrieval. |
|
Let's make sure that we are fully covered by tests, we should avoid any potential mistakes in this one. |
Migaroez
left a comment
There was a problem hiding this comment.
Nice PR Andy, this will surely help make sites with a lot of descendants a noticably faster.
I had a few concerns but they have all been soothed by the extra tests/benchmark you can find at https://github.com/umbraco/Umbraco-CMS/tree/v17/feedback/22742 or c65e7a5
Feel free to incorperate them.
For the benchmark I wanted to know if there was a different impact when there were a lot of document types vs only a few.
Benchmark results — PR (with cache) vs. base (v17/dev)
Both runs on the same machine, same BDN config (InProcessEmitToolchain, 5 iter × 500 ms, 3 warmup).
| Method | Types | Base time | PR time | Δ time | Base alloc | PR alloc |
|---|---|---|---|---|---|---|
| FullSweep | 5 | 385.1 µs | 1.4 µs | −99.6 % | 533.6 KB | 0 B |
| FullSweep | 25 | 8,664.5 µs | 7.5 µs | −99.91 % | 11,509.8 KB | 0 B |
| FullSweep | 50 | 36,838.0 µs | 14.7 µs | −99.96 % | 46,621.4 KB | 0 B |
| RepeatedUnfiltered | 5 | 654.6 µs | 2.3 µs | −99.6 % | 1,371.9 KB | 0 B |
| RepeatedUnfiltered | 25 | 3,153.0 µs | 2.2 µs | −99.93 % | 6,113.3 KB | 0 B |
| RepeatedUnfiltered | 50 | 6,594.1 µs | 2.2 µs | −99.97 % | 12,436.0 KB | 0 B |
…etrieval-performance
|
Thanks @Migaroez. I'll pull in the unit tests certainly. The benchmark I think I'll leave. It's useful to see these results as part of the PR, but it likely isn't something we need permanently in the code base as they aren't tests we run in CI to prevent against regressions, and so may just be unnecessary maintenance. And we already have the main one from this PR. |
…22742) * Add benchmark test for measuring improvements to children and descendant retrieval. * Remove unnecessary sort from retrieval of children. * Return the result of the filtered collection of children/decendants without materialising. * Lazily build property wrappers when materializing IPublishedContent. * Cache the ordered children list on NavigationNode. * Cache descendants per parent on the navigation snapshot. * Add synchronous fast path for retrieved of cached content. * Additional unit tests. * Add TODO to make UpdateSortOrder internal. * Addressed code review feedback. * Further unit tests. * Future-proofed code comments.
…22742) * Add benchmark test for measuring improvements to children and descendant retrieval. * Remove unnecessary sort from retrieval of children. * Return the result of the filtered collection of children/decendants without materialising. * Lazily build property wrappers when materializing IPublishedContent. * Cache the ordered children list on NavigationNode. * Cache descendants per parent on the navigation snapshot. * Add synchronous fast path for retrieved of cached content. * Additional unit tests. * Add TODO to make UpdateSortOrder internal. * Addressed code review feedback. * Further unit tests. * Future-proofed code comments.
|
Cherry picked into |
Description
Issue #22646 reports that
IPublishedContent.Children()/Descendants()traversal is significantly slower in v17's HybridCache than in v13's NuCache, with rendering of recursive sitemap-style templates taking longer than we would see for the equivalent in 13.This PR implements six targeted, non-breaking fixes against the v17 published-cache pipeline.
1. Remove redundant
OrderBy(SortOrder)inGetChildrensrc/Umbraco.Core/Extensions/PublishedContentExtensions.csTryGetChildrenKeysalready returns child keys ordered bySortOrder. The trailing.OrderBy(x => x.SortOrder)in the children-retrieval extension was therefore sorting an already-sorted sequence and forcing materialisation of the lazy chain. To reduce this overhead, I have dropped the call.2. Lazy
FilterAvailablesrc/Umbraco.Core/Services/PublishStatus/PublishedContentStatusFilteringService.cs,PublishedMediaStatusFilteringService.csFilterAvailablepreviously called.ToArray()on its result, materialising every item intoIPublishedContentbefore the consumer's iterator could yield. With the eager materialisation removed and a true lazyIEnumerablereturned, short-circuit consumers (FirstOrDefault,Any,Take(n)) can now exit at the first match instead of paying the full subtree cost.3. Lazy property wrappers in
PublishedContentsrc/Umbraco.PublishedCache.HybridCache/PublishedContent.csThe constructor was eagerly allocating an
IPublishedProperty[ContentType.PropertyTypes.Count]and constructing one wrapper per property type, even when the caller never reads any properties (for example withChildren().Count()or navigation-only walks). This could lead to many wrapper objects per render that may not be needed. Here I have replaced with a lazyEnsureProperties()called when properties are first accessed.4. Cached ordered children on
NavigationNodesrc/Umbraco.Core/Models/Navigation/NavigationNode.cs,src/Umbraco.Core/Services/Navigation/ContentNavigationServiceBase.csNavigationNode._childrenis aConcurrentHashSet<Guid>(unordered), soGetOrderedChildrenwas rebuilding a sorted list on every read. Added a lazy_orderedChildrenarray onNavigationNode, built on first read and invalidated byAddChild/RemoveChildautomatically.UpdateSortOrderrequires the caller to invalidate the parent's cache; all internal callers inContentNavigationServiceBasedo this correctly, and aTODO (V19)is in place to makeUpdateSortOrderinternal in the next major.5. Cached descendants per navigation snapshot
src/Umbraco.Core/Services/Navigation/ContentNavigationServiceBase.csAdded a
DescendantsCache(ConcurrentDictionary<(Guid Parent, Guid? ContentType), Guid[]>) on the immutableNavigationSnapshot. RepeatedDescendants()calls on the same parent — which a single Razor page could be doing — now hit a cache instead of re-walking the navigation tree.6. Synchronous fast path for
GetByIdsrc/Umbraco.Core/PublishedCache/IDocumentCacheService.cs,IMediaCacheService.cs,src/Umbraco.PublishedCache.HybridCache/Services/DocumentCacheService.cs,MediaCacheService.cs,src/Umbraco.PublishedCache.HybridCache/DocumentCache.cs,MediaCache.csIPublishedContentCache.GetByIdwas sync-over-async (GetByIdAsync(...).GetAwaiter().GetResult()) on every key — including the dominant case where the L0 - the converted-to-IPublishedContent-cache we have introduced - already has the entry. This forced an async state-machine setup per call. AddedTryGetCached(Guid key, bool preview, out IPublishedContent? content)to the cache-service interfaces with a default implementation that returnsfalse, plus concrete overrides on the HybridCache services. SyncGetById, which is called from synchronous rendering calls such as.Children(), now takes the fast path on hits and falls through to the existing async path only on L0 misses.Benchmark Results
A
BenchmarkDotNet-based suite has been added undertests/Umbraco.Tests.Benchmarks/HybridCacheNavigationBenchmarks.cs, exercisingChildren().Count(),Children()+ property reads,Descendants().Count(),Descendants()+ property reads,Descendants().FirstOrDefault(), and a recursiveChildren().Any()+ recurse pattern (the issue-reporter's snippet) against a synthetic 5,051-node tree.Headline cumulative deltas:
.FirstOrDefault()/.Any()/.Take(n)onDescendants()Children().Any()+ recurse patternDescendants().Count()(full enumeration)Children().Count()Full details of the benchmarking work, showing the benefits achieved from each update
Setup
Hardware: BenchmarkDotNet v0.15.8, Windows 11, Intel Core i5-10300H @ 2.50 GHz, .NET 10.0.5, x64 RyuJIT.
Toolchain:
InProcessEmitToolchain,IterationCount=10,IterationTime=500ms,LaunchCount=1,WarmupCount=5.Tree shape:
BranchCount=50,LeafCount=100(≈ 5,051 nodes).Comparison Testing
Before
After
Test 1 - Original v 2.1 (drop redundant
OrderBy(SortOrder)inGetChildren)Only the benchmarks that exercise
GetChildrenare reported below.Descendants_Count,Descendants_ReadOneProperty, andDescendants_FirstOrDefaultgo throughEnumerateDescendantsOrSelfInternal, which never had the redundantOrderByremoved in §2.1, so they can't be moved by this change and are omitted.RecursiveTraversalis included because the recursive snippet callsx.Children()twice per node and therefore does flow throughGetChildren.Before
Branch:
v17/improvement/children-and-descendent-retrieval-benchmark. Baseline; no production changes.After
Branch:
v17/improvement/children-and-descendent-retrieval-performance. Single change: dropped.OrderBy(x => x.SortOrder)fromGetChildreninsrc/Umbraco.Core/Extensions/PublishedContentExtensions.cs.INavigationQueryService.TryGetChildrenKeysalready returns keys ordered bySortOrderandFilterAvailablepreserves enumeration order, so the trailing sort was sorting an already-sorted sequence.Deltas
StdDev was 1 %–7 % of the mean across these benchmarks. 99.9 % CI half-widths (BenchmarkDotNet's
Errorcolumn): Children_Count ±6.2 % / ±9.8 %; Children_ReadOneProperty ±5.6 % / ±4.2 %; RecursiveTraversal ±2.1 % / ±8.9 %. Every delta in the table sits inside the noise band of at least one of its A/B half-widths.Summary and Conclusions
This comparison gives no clear time signal. All three benchmarks land within the noise band of one or both runs, and the directions are mixed (Children_Count slightly slower, Children_ReadOneProperty slightly faster, RecursiveTraversal slightly slower). Allocation deltas are all under 5 % and within plausible MemoryDiagnoser variance.
Verdict. §2.1 is correct (the keys are sorted upstream,
FilterAvailablepreserves order) and strictly cheaper at the source-code level (one fewer LINQ buffer/sort perChildren()call), but the wall-clock and allocation impact is below the measurement precision we can achieve at this benchmark scale on this hardware. The mixed-direction deltas tell us the change is too small to register reliably here, not that it's harmful — the LINQOrderByover 50 already-sorted items costs on the order of microseconds and a kilobyte, and those costs are dwarfed by the per-node factory work and navigation walk that dominate every benchmark we have. Keep the change for code-hygiene reasons; it removes a strictly-redundant operation, but there is no performance fix on its own.Test 2 - Original v 2.2 (remove eager materialization on
FilterAvailable). Cumulative (i.e. with 2.1 as well).§2.2 changes
PublishedContentStatusFilteringService.FilterAvailable(and the media equivalent) to return a lazyIEnumerable<IPublishedContent>instead of materialising the full result with.ToArray(). BecauseFilterAvailableis called from bothGetChildren(the Children path) andEnumerateDescendantsOrSelfInternal(the Descendants path), all six benchmarks are potentially affected and all are reported below.Before
Branch:
v17/improvement/children-and-descendent-retrieval-benchmark. Original baseline; no §2.1 or §2.2 applied.After
Branch:
v17/improvement/children-and-descendent-retrieval-performancewith §2.1 + §2.2 applied (both committed).FilterAvailablein bothPublishedContentStatusFilteringServiceandPublishedMediaStatusFilteringServicenow returns a lazy enumeration rather than calling.ToArray()on the result, and the redundant.OrderBy(SortOrder)has been removed fromGetChildren.Deltas
StdDev was 2 %–8 % of the mean across these benchmarks. 99.9 % CI half-widths (BenchmarkDotNet's
Errorcolumn): Children_Count ±11.5 % / ±7.3 %; Children_ReadOneProperty ±9.1 % / ±7.4 %; Descendants_Count ±5.6 % / ±3.4 %; Descendants_ReadOneProperty ±4.8 % / ±7.0 %; Descendants_FirstOrDefault ±5.4 % / ±12.6 %; RecursiveTraversal ±6.1 % / ±7.2 %.Summary and Conclusions
This is the clean original-vs-cumulative comparison. Two large wins, three noise-band results, and one unexpected regression worth flagging:
Descendants_FirstOrDefaultdropped −94.2 % time, −86.8 % allocation (13,730 → 791 μs; 6,227 → 823 KB). This is the headline §2.2 win: the eager.ToArray()inFilterAvailablepreviously materialised every descendant before the foreach yielded the first, so an "ask for the first descendant" call paid the full subtree cost. With the lazy return, FirstOrDefault short-circuits after the first yield and only the navigation tree-walk remains. CI half-widths are well inside the delta. Real, dramatic signal.RecursiveTraversaldropped −46.4 % time, −45.7 % allocation (29,834 → 15,999 μs; 12,941 → 7,025 KB). This is the user-style snippet from the bug report —x.Children().Any()followed by recursion intox.Children(). With the cumulative §2.1 + §2.2 in place,.Any()short-circuits after the first child of every parent, and the redundantOrderByno longer forces materialisation back into eager mode. Test 1 confirmed §2.1 alone has no measurable effect, and §2.2 alone is masked by the eagerOrderByon this path — the two changes are synergistic and only the cumulative pair produces this win. Real, large signal.Children_Count,Children_ReadOneProperty,Descendants_Countall sit within ±5 % and CI half-widths cover the deltas. These benchmarks fully enumerate the result, so neither theOrderByremoval nor the laziness inFilterAvailablecan save them work — the outcome is correctly neutral.Descendants_ReadOnePropertyregressed +11.0 % (19,957 → 22,159 μs). A confirmation re-run reproduced the regression at 23.22 ms (with StdDev 12 % — much wider variance than the Before's 5 %), so the regression is real. The mechanism is the cost of laziness itself: pre-§2.2,FilterAvailablereturned a materialisedIPublishedContent[]and the consumer's foreach iterated a single array (a tight, JIT-friendly loop); post-§2.2 the same foreach walks four extra LINQ iterator levels (Where → Select → WhereNotNull → Where) per item, adding a few hundred ns of dispatch overhead × 5,051 items ≈ 1-2 ms. OnDescendants_Countthis overhead is offset by skipping the array materialisation (net −4 %), but on the property-read benchmark per-item work is heavier so the lazy dispatch dominates. The trade-off is favourable on balance — a 1-2 ms regression on full-enumeration vs a 12 ms saving on short-circuit consumers — and §2.3 (lazy property wrapping) should claw most of it back, so leave as-is and revisit after §2.3.Verdict. §2.2 — combined with §2.1 — produces the largest performance improvement in the plan to date:
.FirstOrDefault()/.Any()/.Take(n)onDescendants()Children().Any()+ recurse patternBoth are exactly the consumer patterns that Issue #22646 called out. The "minutes for thousands of nodes" symptom should be substantially mitigated for any traversal that doesn't actually need the full enumeration. Full-enumeration paths (
.Count(),.ToList(),foreach-the-whole-thing) are unchanged, which is the correct outcome — laziness can only help when the consumer doesn't drain the sequence.Test 3 - Original v 2.3 (lazy property wrapping in
PublishedContent). Cumulative (§2.1 + §2.2 + §2.3).§2.3 defers
PublishedContent'sIPublishedProperty[]construction to first access viaEnsureProperties(), guarded byInterlocked.CompareExchangefor thread safety. This avoids allocating onePublishedPropertywrapper per property type for traversal-only operations. The change should also claw back the ~10 % regression we saw onDescendants_ReadOnePropertyafter §2.2.Before
Branch:
v17/improvement/children-and-descendent-retrieval-benchmark. Original baseline; no §2.1, §2.2, or §2.3 applied.After
Branch:
v17/improvement/children-and-descendent-retrieval-performancewith §2.1 + §2.2 + §2.3 applied (all committed).Deltas
StdDev was 1 %–7 % of the mean across these benchmarks. 99.9 % CI half-widths (BenchmarkDotNet's
Errorcolumn): Children_Count ±3.5 % / ±9.8 %; Children_ReadOneProperty ±3.9 % / ±2.3 %; Descendants_Count ±5.4 % / ±5.9 %; Descendants_ReadOneProperty ±2.0 % / ±4.9 %; Descendants_FirstOrDefault ±4.5 % / ±5.9 %; RecursiveTraversal ±6.6 % / ±6.6 %.Summary and Conclusions
Descendants_FirstOrDefaultdropped −94.9 % time, −90.8 % allocation (13,421 → 679 μs; 8,957 → 823 KB). The headline §2.2 win, confirmed against the original baseline. Slightly larger than Test 2's −94.2 % / −86.8 % — consistent measurement.RecursiveTraversaldropped −47.6 % time, −45.7 % allocation (28,080 → 14,725 μs; 12,941 → 7,025 KB). The user-style snippet from the bug report. Stable result vs Test 2's −46.4 % / −24.1 % (the allocation drop is bigger here because the §2.2-only baseline used in Test 2's "Before" had partial laziness already in working tree; this clean baseline shows the full delta).The
Descendants_ReadOnePropertyregression was clawed back by §2.3. Test 2 showed +11 % time on this benchmark (and a re-run hit +12 %); §2.2's lazy iterator dispatch was visible above the per-item property-read work. With §2.3 in place the per-item materialisation cost drops (no eager property array allocation perPublishedContent), and the regression collapses to +2.9 % — sitting inside the After CI half-width (±4.9 %), so effectively noise-band. Net new behaviour: this benchmark is unchanged from baseline.Descendants_Countis unchanged (−0.3 %, well inside CI). Correct: full enumeration with no property reads, neither §2.2's laziness nor §2.3's lazy property wrapping can help or hurt.Children_CountandChildren_ReadOnePropertyshow small regressions (+6.2 % and +6.0 %).Children_Count's delta sits inside its wide After CI (±9.8 %) so is most plausibly noise — its Before measurement was tighter than usual (StdDev 2.3 %) and the After widened (StdDev 6.5 %).Children_ReadOnePropertyis statistically real (delta outside both CI half-widths) but small in absolute terms (~10 μs over 50 children = ~0.2 μs per child) and has no clear causal mechanism —EnsureProperties()adds a single null-check on the hot path, which shouldn't account for 0.2 μs per call. Probably second-order effects (JIT inlining behaviour around the lazy dispatch chain) rather than a real cost worth pursuing. Re-running these specific benchmarks would help triangulate, but at this magnitude on absolute timings under 200 μs the result is at the edge of what we can measure reliably.Verdict. All three changes (§2.1 + §2.2 + §2.3) are now in place, and the cumulative benchmark picture is:
.FirstOrDefault()/.Any()/.Take(n)onDescendants()Children().Any()+ recurse pattern* The Children-shaped regressions are small in absolute terms (≈ 10 μs per call) and partially within noise. They've not been pinned to a clear mechanism and are not user-visible at any meaningful scale — Children() over 50 immediate children at 130 μs vs 124 μs is well inside any consumer's perception threshold.
The bug-report symptoms (#22646) are addressed: any traversal that short-circuits or that follows the recursive pattern now runs in tens of percent of the original time/allocation. Full-enumeration patterns are unchanged — the trade-off §2.2 introduced was successfully neutralised by §2.3.
Test 4 - Original v 2.4 (cache ordered children on
NavigationNode). Cumulative (§2.1 + §2.2 + §2.3 + §2.4).§2.4 caches the sort-order-presorted child key array on each
NavigationNode.TryGetChildrenKeyspreviously rebuilt aList<(Guid, int)>and sorted bySortOrderon every call; now the sortedGuid[]is built once and reused across calls, invalidated only whenAddChild/RemoveChildmutate the children set orInvalidateOrderedChildrenis called externally (e.g. fromContentNavigationServiceBase.UpdateSortOrderafter a child'sSortOrderchanges). The recursiveGetDescendantsRecursivelywalk uses the sameGetOrderedChildrenhelper, so descendants paths benefit too.Before
Branch:
v17/improvement/children-and-descendent-retrieval-benchmark. Original baseline; no fixes applied.After
Branch:
v17/improvement/children-and-descendent-retrieval-performancewith §2.1 + §2.2 + §2.3 + §2.4 applied (all committed).Deltas
StdDev was 2 %–6 % of the mean across these benchmarks. 99.9 % CI half-widths (BenchmarkDotNet's
Errorcolumn): Children_Count ±8.4 % / ±8.1 %; Children_ReadOneProperty ±6.5 % / ±6.9 %; Descendants_Count ±7.0 % / ±6.9 %; Descendants_ReadOneProperty ±8.2 % / ±9.1 %; Descendants_FirstOrDefault ±8.1 % / ±9.0 %; RecursiveTraversal ±2.8 % / ±4.2 %.Summary and Conclusions
The headline result of §2.4 is that every benchmark allocates less now, in addition to the time wins from §2.2/§2.3 holding up. The cached
Guid[]is built once during the GlobalSetup pre-warm and reused on every benchmark iteration, so no benchmark pays the per-call list-allocate-and-sort overhead that was previously happening insideGetOrderedChildrenfor everyChildren()and recursiveDescendants()step.Descendants_FirstOrDefaultdropped −97.2 % time, −94.6 % allocation (13,386 → 378 μs; 6,227 → 339 KB). Largest delta we've measured. Beyond the §2.2 short-circuit, the Descendants path's recursive navigation walk now uses cached children at every level, so even the "navigation walk + first lookup" cost dropped.RecursiveTraversaldropped −49.5 % time, −33.3 % allocation (26,924 → 13,587 μs; 12,944 → 8,630 KB). The recursiveChildren().Any()+ recurse pattern. Time saving is consistent with Test 3's −47.6 %; allocation also dropped meaningfully because the recursive path no longer re-allocates a list-and-sort per parent visited.Children_CountandChildren_ReadOnePropertyallocations dropped 9–10 % with time deltas inside noise (−4.8 % and −1.2 %, both well inside their CI half-widths of ±8 %). The time impact of §2.4 on this small workload is below our measurement floor, but the deterministic allocation drop is real and a clean win.Descendants_CountandDescendants_ReadOnePropertyshow small allocation drops (−8.4 %, −6.7 %) with time deltas inside noise onDescendants_Count(+2.2 %) and at the edge of noise onDescendants_ReadOneProperty(+7.3 %, just inside the After CI of ±9.1 %). TheDescendants_ReadOnePropertyreading continues to be the most variable benchmark across runs, but the delta has stayed within the noise band on every cumulative test we've run.Verdict. §2.4 delivers an across-the-board allocation reduction in addition to compounding the §2.2/§2.3 wins on the short-circuit and recursive paths. The cumulative picture across the four changes is now:
.FirstOrDefault()/.Any()/.Take(n)onDescendants()Children().Any()+ recurse pattern* Time deltas within the noise band on the laptop. The
Descendants_ReadOnePropertytime has fluctuated between +3 % and +7 % across the cumulative tests; the cause (lazy iterator dispatch overhead vs eager array enumeration on full-drain consumers) is understood and small enough relative to the wins not to act on at this stage.The bug-report symptoms (#22646) are now decisively addressed: any traversal that short-circuits drops by 95 %+ in time and allocation, the recursive pattern from the issue drops in half, and full-enumeration paths are either neutral or a touch better. Combined with the deterministic per-call allocation reductions across every code path, this is a solid place to land the cumulative changes.
Test 5 - Original v 2.5 (cache descendants per parent on the navigation snapshot). Cumulative (§2.1 + §2.2 + §2.3 + §2.4 + §2.5).
§2.5 caches the full descendant
Guid[]per(parent, contentType?)pair on theNavigationSnapshot, populated lazily on first read and invalidated by any mutation. BothTryGetDescendantsKeys(unfiltered) andTryGetDescendantsKeysOfType(filtered by content type) are cached. After the GlobalSetup pre-warm callsDescendants().Count()once, every subsequent benchmark iteration short-circuits the recursive tree walk entirely.Before
Branch:
v17/improvement/children-and-descendent-retrieval-benchmark. Original baseline; no fixes applied.After
Branch:
v17/improvement/children-and-descendent-retrieval-performancewith §2.1 + §2.2 + §2.3 + §2.4 + §2.5 applied (all committed).Deltas
StdDev was 1 %–12 % of the mean across the benchmarks. 99.9 % CI half-widths (BenchmarkDotNet's
Errorcolumn): Children_Count ±8.9 % / ±5.2 %; Children_ReadOneProperty ±6.1 % / ±5.3 %; Descendants_Count ±9.1 % / ±7.7 %; Descendants_ReadOneProperty ±7.2 % / ±4.8 %; Descendants_FirstOrDefault ±6.1 % / ±5.6 %; RecursiveTraversal ±18.5 % / ±12.8 %.Summary and Conclusions
§2.5 delivers the second dramatic win after §2.2.
Descendants_FirstOrDefaultcollapses from ~14.7 ms to 3.6 μs — a four-thousand-fold speedup — because once the descendants cache is populated, even the residual navigation walk that §2.4 left behind is skipped. The benchmark now spends its time on a single cache lookup, one publish-status check, oneGetById, and yielding the first item.Descendants_FirstOrDefault: −99.98 % time, −99.96 % allocation (14,744 → 3.6 μs; 6,227 → 2.2 KB). Largest single delta we've measured by orders of magnitude. The cachedGuid[]returned by the snapshot's descendants cache eliminates the recursive tree walk entirely; any short-circuit consumer pays only the per-key cost of yielding the items it actually wants.RecursiveTraversal: −50.8 % time, −54.4 % allocation (31,166 → 15,330 μs; 12,941 → 5,899 KB). Time saving is consistent with Test 4 (−49.5 %); allocation saving is bigger this round (−54 % vs Test 4's −33 %) because the recursive walk's intermediate descendants lists no longer get rebuilt for every parent visit.Descendants_ReadOneProperty: −5.2 % time, −11.0 % allocation (21,734 → 20,613 μs; 7,844 → 6,985 KB). The §2.2 lazy-iterator regression we saw on this benchmark in Tests 2-4 is now firmly gone — full-enumeration with property reads is faster on the cumulative path than the original.Descendants_Count: −12.4 % time, +30.1 % allocation (15,622 → 13,683 μs; 6,227 → 8,097 KB). Time improved meaningfully but the allocation reading went up. This is the one anomaly worth flagging. Cache hits should save ~80 KB per call (the avoidedList<Guid>(5050)allocation in the recursive walk) — they shouldn't add 2 MB. Possible explanations: (a) BenchmarkDotNet's MemoryDiagnoser samples allocations in a separate stage and is sensitive to GC heap state, which differs once the cache holds long-livedGuid[]references; (b) run-to-run variance — Test 3's same-config run measured 8,917 KB on this benchmark while Test 4 measured 5,703 KB on the same code, so a 2 MB swing across runs has historical precedent. The +30 % is within the historical envelope on this benchmark across the various tests we've recorded. Worth a closer look if it persists across more runs but not blocking.Children_Count: −4.3 % time, −10.0 % allocation andChildren_ReadOneProperty: +1.2 % time, −9.0 % allocation. Time deltas inside the noise band; allocation savings carried over from §2.4.Cumulative picture — all five fixes
.FirstOrDefault()/.Any()/.Take(n)onDescendants()Children().Any()+ recurse patternDescendants().Count()Descendants()+ property reads* Allocation regression on
Descendants_Countis unexplained by the change's mechanism (cache hits should reduce allocation, not increase it) and is most likely measurement variance between BDN runs — historical readings on this benchmark have ranged from 5.7 MB to 8.9 MB on the post-fix code across separate test sessions. Worth re-measuring later if it persists.The headline result is that the original bug-report scenario now runs in ~0.025 % of the original time. Any traversal pattern that asks for less than the full tree —
FirstOrDefault,Any(predicate),Take(n), recursive.Any()checks — is now effectively free after the first read. Full-enumeration paths are unchanged or better. This is a strong landing point for the cumulative changes.Test 6 - Original v 2.7 (synchronous L0 fast path on the cache wrappers). Cumulative (§2.1 + §2.2 + §2.3 + §2.4 + §2.5 + §2.7).
§2.7 adds a sync
TryGetCached(Guid, [bool], out IPublishedContent?)toIDocumentCacheServiceandIMediaCacheServicethat performs only the L0 in-memory dictionary lookup.DocumentCache.GetById(bool, Guid)andMediaCache.GetById(bool, Guid)now check the fast path first and only fall through toGetByIdAsync(...).GetAwaiter().GetResult()on a miss. On a warm site the per-key sync calls insideFilterAvailable's lazy chain no longer set up an async state machine for each item.Before
Branch:
v17/improvement/children-and-descendent-retrieval-benchmark. Original baseline; no fixes applied.After
Branch:
v17/improvement/children-and-descendent-retrieval-performancewith §2.1 + §2.2 + §2.3 + §2.4 + §2.5 + §2.7 applied (all committed).Deltas
StdDev was 1 %–4 % of the mean across the benchmarks — tightest run we've recorded. 99.9 % CI half-widths (BenchmarkDotNet's
Errorcolumn): Children_Count ±5.9 % / ±3.7 %; Children_ReadOneProperty ±5.8 % / ±1.7 %; Descendants_Count ±3.1 % / ±2.6 %; Descendants_ReadOneProperty ±4.2 % / ±5.5 %; Descendants_FirstOrDefault ±2.5 % / ±12.3 %; RecursiveTraversal ±2.1 % / ±3.8 %.Summary and Conclusions
§2.7 is the change that broadly improves the full-enumeration paths. The earlier §2.4 / §2.5 caches made the navigation walk fast; §2.7 now makes the per-key fetch fast too, by eliminating the async state-machine setup on the dominant L0-hit case. The benchmarks where many keys are pulled through the lazy chain —
Children_*,Descendants_Count,RecursiveTraversal— all show meaningful allocation drops on top of the existing time wins.Children_Countdropped −15 % time, −32 % allocation (108 → 92 μs; 60 → 41 KB). This benchmark callsGetById50 times per iteration. Each call previously paid for an async state-machine setup just to read a single dictionary entry. With the sync fast path, those overheads are gone — both time and allocation reflect that.Children_ReadOnePropertydropped −5 % time, −26 % allocation (148 → 141 μs; 77 → 57 KB). Same per-key savings, but the property-read work dominates the time so the relative time delta is smaller. Allocation savings track the per-call state-machine elimination directly.Descendants_Count: −17 % time, −55 % allocation (12,108 → 10,111 μs; 8,957 → 4,065 KB). Largest allocation drop on the full-enumeration benchmarks — 5,050 per-key sync calls × ~1 KB of state-machine boxes per call eliminated.Descendants_FirstOrDefault: −99.97 % time, −99.98 % allocation (12,085 → 3.1 μs; 8,957 → 1.95 KB). The §2.5 win compounds: the navigation cache returns the descendant array, the first key hits L0 via the new sync fast path, and the result yields without any async machinery whatsoever.RecursiveTraversal: −52 % time, −44 % allocation (23,541 → 11,427 μs; 12,944 → 7,314 KB). The recursive.Any()+ recurse pattern hitsChildren()thousands of times. Eliminating the async state machine on each of those L0 hits is what moves the time delta past Test 5's −50 %.Descendants_ReadOneProperty: −8 % time but +23 % allocation (17,832 → 16,481 μs; 7,844 → 9,615 KB). The allocation reading is the only anomaly in this run. There is no causal mechanism for §2.7 to increase allocation — eliminating async state machines should only ever reduce it. Across our test sessions this benchmark's allocation reading has fluctuated meaningfully (Test 4: 7,805 KB; Test 5: 7,321 KB; Test 6: 9,615 KB) while the post-fix code is essentially identical, so the most plausible explanation is run-to-run variance in BDN's MemoryDiagnoser sampling. Worth flagging but not blocking.Cumulative picture — all six fixes (§2.1 → §2.5, §2.7)
.FirstOrDefault()/.Any()/.Take(n)onDescendants()Children().Any()+ recurse patternDescendants().Count()Descendants()+ property readsChildren().Count()Children()+ property reads*
Descendants_ReadOnePropertyallocation reading is the historically variable benchmark in this suite; the +23 % figure is most plausibly run-to-run BDN MemoryDiagnoser variance rather than a real effect of §2.7.The original bug-report symptom — recursive
.Any()-style traversal taking minutes for thousands of nodes — is now ~50 % faster on top of all the prior savings. Short-circuit consumers are essentially free. Even the full-enumeration benchmarks, which were "noise-band" through Tests 1-3, now show clearly measurable improvements. This is the strongest landing point we've recorded for the cumulative changes.Rendering Results
A real-world Razor rendering test was also run on a 20,000-node, 4-level tree with a sitemap-style template (recursive descendants walk + per-node property reads + short-circuit consumers + repeated
Descendants()calls):Template used for rendering testing
Memory impact
Two of the six fixes add new caches:
NavigationNode._orderedChildren: per-node sorted-childrenGuid[]. Stored on every node in the navigation structure.NavigationSnapshot.DescendantsCache: per-snapshot descendantsGuid[]keyed by(parent, contentType). Populated lazily — only for parents whoseDescendants()(orDescendants().OfType<>()) is actually called.Both caches hold only
Guid[]arrays (16 bytes per entry). They're likely an order of magnitude smaller per item than the existing HybridCache layer, which stores fully convertedIPublishedContentinstances complete with property wrappers, content-type references, and route metadata (hundreds of bytes to several KB per node). There is no unbounded growth path: every cache entry is keyed against state that exists in the navigation structure, so the cache cannot exceed the structure's size.Given that I think the trade-off between faster performance and increased memory footprint is a reasonable one to make.
Potential Further Work
The remaining 1.34× gap to v13 I feel is likely coming from
IPublishedContentfactory work; closing that further would need a deeper rework and is deferred. It could perhaps come for 18+ given we've now removed the.Parentand.Childrenproperties that I understand previously precluded this.Testing
Testing will most be from review of the benchmark and rendering tests I've done above. I've also added various automated tests to verify that the performance refactoring doesn't impact the behaviour.
This item has been added to our backlog AB#68180