Published Cache: Fix multi-site domains falling back to the first root node after restart#23084
Conversation
|
Claude finished @AndyButland's task in 5m 19s —— View job PR ReviewTarget: Fixes a startup race in
Suggestions
Approved with SuggestionsThe core fix is correct and mirrors the pattern used in the sibling #22393. The double-checked lock is properly guarded by |
There was a problem hiding this comment.
Pull request overview
Fixes a startup race in DomainCacheService where concurrent first access could observe _initialized == true before the domain cache had finished loading, causing multi-site domain resolution to fall back to the first root node after an app restart.
Changes:
- Makes domain-cache lazy initialization thread-safe via double-checked locking and only sets
_initializedafterLoadDomains()completes. - Rebuilds domains into a fresh
ConcurrentDictionaryand swaps it in atomically withInterlocked.Exchangeto avoid transient partial/empty reads during rebuild. - Adds unit tests covering wildcard filtering, concurrent first-access behavior, and single-item refresh add/update semantics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/Umbraco.PublishedCache.HybridCache/Services/DomainCacheService.cs |
Hardens domain-cache initialization and rebuild logic against concurrent access after restart. |
tests/Umbraco.Tests.UnitTests/Umbraco.PublishedCache.HybridCache/DomainCacheServiceTests.cs |
Adds unit coverage for the race condition and refresh/filtering behaviors. |
|
@AndyButland this looks really good. Makes absolute sense. I wonder if it'd be meaningful to introduce a notification handler for Likely, other caches would also benefit from eager load, so it could be a generic-ish notification handler for that very purpose. Something like: using Umbraco.Cms.Core.Events;
using Umbraco.Cms.Core.Notifications;
using Umbraco.Cms.Core.PublishedCache;
namespace Umbraco.Cms.Web.UI.Custom;
internal sealed class EagerLoadCriticalRuntimeCachesNotificationHandler : INotificationHandler<UmbracoApplicationStartedNotification>
{
private readonly IDomainCacheService _domainCacheService;
public EagerLoadCriticalRuntimeCachesNotificationHandler(IDomainCacheService domainCacheService)
=> _domainCacheService = domainCacheService;
public void Handle(UmbracoApplicationStartedNotification notification) => _domainCacheService.GetAll(true);
}This won't remove the race condition, since the app is ready to serve requests at this point, but it would probably lower the chance of it happening. On the flipside, it potentially masks a bug in the system. Thoughts? |
|
Seems a good idea to make sure that the cache is loaded before the first request. But are you suggesting this as well as the what I've proposed, or instead of? Seems even with this there's still a chance a request could jump in first, so would be best to have the race condition hardening as well. |
|
Oh, definitively as an addendum to this PR 😄 |
|
In which case perhaps one to consider in a separate PR to keep this focussed on fixing the race condition? As you say, could be there are others to consider, and we could look at making it part of the existing |
|
All good for me. I will make a note for a future task 👍 |
…t node after restart (#23084) * Prevent empty domain cache during concurrent initialization. * Addressed code review comments and added further comment to the code. * Use Lock object.
…t node after restart (#23084) * Prevent empty domain cache during concurrent initialization. * Addressed code review comments and added further comment to the code. * Use Lock object.
|
Cherry-picked to |
IDomainService.GetAll was removed in #22629; DomainCacheServiceTests was added later in #23084 against a stale base and still mocked the removed method, breaking the Release build on release/18.0. Production DomainCacheService already calls GetAllAsync, so update the four mock setups to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Description
There's no public issue for this yet — it surfaced from a partner support case:
Potential cause
I say potential, as the issue isn't easily replicable, so whilst it looks like a legitimate issue to fix, it's not possible to 100% verify that it applies in this case.
DomainCacheServiceis a lazily-initialised singleton: the domain cache is loaded from the database on the first request that needs it. The initialisation guard set its_initializedflag totruebeforeLoadDomains()had finished populating the cache:Under concurrent first access — typical right after a restart, when several requests arrive together — a second request could observe
_initialized == true, skip loading, and read an empty domain cache. With no domains,PublishedRouter.FindDomainfinds no match andContentFinderByUrlNewfalls back to routing under the first root node. The result: on a multi-site setup, every site serves the first site's content until the cache is rebuilt or the app is restarted again.We applied similar defensive fixes to published-content-cache race conditions in #22393 (repoted in #22254 and #22384).
Fix
_initializedonly afterLoadDomains()completes, guarded by a double-checked lock, so concurrent callers block until the cache is populated rather than reading it empty.volatile_domainsfield. Marking the fieldvolatilegives the lock-free readers on the routing hot path an acquire-read, so they consistently observe the fully populated dictionary (and the_initializedflag) — strictly required on weak memory models such as ARM, where Published Content Cache: Defensive hardening against race conditions (closes #22254, #22384) #22393'sInterlocked.Exchange-on-a-non-volatile-field approach leaves the reads unfenced. This also removes a transient partial-population window duringRefreshAlland drops domains removed since the last load.Tidy-up
<inheritdoc />for interface members).AddOrUpdatecalls (one of which constructed the value twice) with theConcurrentDictionaryindexer.Testing
Automated tests
I've added unit tests that reliably show the issue before the fix by failing and pass once the fix is in place.
Cannot_Observe_Empty_Domain_Cache_During_Concurrent_First_Accessverifies the fix. It mocksIDomainServiceso the first caller blocks inside the gated domain load (ManualResetEventSlim); a second caller then races in. The test asserts a blocking invariant: with a correctly initialised cache the second caller must block on initialisation and cannot complete until the load is released, so it never observes an empty cache. With the bug it skips initialisation and completes immediately with an empty result, which the test detects and fails on. Because it asserts blocking rather than waiting out a fixed delay, it's deterministic regardless of CI scheduling. (Verified by reintroducing the bug: fails 3/3; with the fix: passes.)Can_Get_Configured_Domains_Excluding_Wildcardscovers the load path and wildcard filtering.Can_Refresh_To_Add_And_Update_A_Domaincovers the single-itemRefreshpath: an unknown domain is added, and a subsequent refresh of the same id updates it in place without duplicating — guarding theAddOrUpdate→ indexer change.Can_Replace_Entire_Cache_On_RefreshAllcovers theRefreshAllpath: loads two domains, removes one at the source, firesRefreshAll, and asserts the cache is fully rebuilt with the removed domain dropped.ManualResetEventSlimwas new to me - here's some explanation.Manual
This is a startup race, so it's awkward to trigger by hand, but, on a multi-site install with two root nodes, each with its own assigned domain/hostname, restart the app and issue concurrent requests to both hostnames. Before the fix, both can intermittently serve the first root node's content; after the fix they consistently resolve to their own root.