Skip to content

Published Cache: Fix multi-site domains falling back to the first root node after restart#23084

Merged
kjac merged 3 commits into
v17/devfrom
v17/bugfix/domain-cache-init-race
Jun 7, 2026
Merged

Published Cache: Fix multi-site domains falling back to the first root node after restart#23084
kjac merged 3 commits into
v17/devfrom
v17/bugfix/domain-cache-init-race

Conversation

@AndyButland

@AndyButland AndyButland commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Description

There's no public issue for this yet — it surfaced from a partner support case:

I have a customer with a multisite setup, where hostnames will unpredictably all point to the same site. They are currently on CMS v17.3.5 and are experiencing an issue where sometimes after a reboot, both custom domains on the site will serve content from the first root node.

Potential cause

I say potential, as the issue isn't easily replicable, so whilst it looks like a legitimate issue to fix, it's not possible to 100% verify that it applies in this case.

DomainCacheService is a lazily-initialised singleton: the domain cache is loaded from the database on the first request that needs it. The initialisation guard set its _initialized flag to true before LoadDomains() had finished populating the cache:

if (_initialized) return;
_initialized = true;   // set before the load
LoadDomains();

Under concurrent first access — typical right after a restart, when several requests arrive together — a second request could observe _initialized == true, skip loading, and read an empty domain cache. With no domains, PublishedRouter.FindDomain finds no match and ContentFinderByUrlNew falls back to routing under the first root node. The result: on a multi-site setup, every site serves the first site's content until the cache is rebuilt or the app is restarted again.

We applied similar defensive fixes to published-content-cache race conditions in #22393 (repoted in #22254 and #22384).

Fix

  • Flip _initialized only after LoadDomains() completes, guarded by a double-checked lock, so concurrent callers block until the cache is populated rather than reading it empty.
  • Build the domain set into a fresh dictionary and publish it as a single write to a volatile _domains field. Marking the field volatile gives the lock-free readers on the routing hot path an acquire-read, so they consistently observe the fully populated dictionary (and the _initialized flag) — strictly required on weak memory models such as ARM, where Published Content Cache: Defensive hardening against race conditions (closes #22254, #22384) #22393's Interlocked.Exchange-on-a-non-volatile-field approach leaves the reads unfenced. This also removes a transient partial-population window during RefreshAll and drops domains removed since the last load.

Tidy-up

  • Added XML docs to the class, constructor, and methods (<inheritdoc /> for interface members).
  • Removed two low-value comments.
  • Replaced two redundant AddOrUpdate calls (one of which constructed the value twice) with the ConcurrentDictionary indexer.

Testing

Automated tests

I've added unit tests that reliably show the issue before the fix by failing and pass once the fix is in place.

  • Cannot_Observe_Empty_Domain_Cache_During_Concurrent_First_Access verifies the fix. It mocks IDomainService so the first caller blocks inside the gated domain load (ManualResetEventSlim); a second caller then races in. The test asserts a blocking invariant: with a correctly initialised cache the second caller must block on initialisation and cannot complete until the load is released, so it never observes an empty cache. With the bug it skips initialisation and completes immediately with an empty result, which the test detects and fails on. Because it asserts blocking rather than waiting out a fixed delay, it's deterministic regardless of CI scheduling. (Verified by reintroducing the bug: fails 3/3; with the fix: passes.)
  • Can_Get_Configured_Domains_Excluding_Wildcards covers the load path and wildcard filtering.
  • Can_Refresh_To_Add_And_Update_A_Domain covers the single-item Refresh path: an unknown domain is added, and a subsequent refresh of the same id updates it in place without duplicating — guarding the AddOrUpdate → indexer change.
  • Can_Replace_Entire_Cache_On_RefreshAll covers the RefreshAll path: loads two domains, removes one at the source, fires RefreshAll, and asserts the cache is fully rebuilt with the removed domain dropped.

ManualResetEventSlim was new to me - here's some explanation.

ManualResetEventSlim is a lightweight thread-synchronisation primitive from System.Threading — essentially a gate. A thread calls Wait() to block until another thread calls Set() to open the gate.

In the test it's used purely to make the race deterministic: it lets us freeze one caller inside the domain load while a second caller races in, instead of relying on luck to hit a microsecond-wide window. The Set()/Wait() pair also establishes a memory barrier, which is what makes the second thread reliably observe the prematurely-set _initialized flag on the unfixed code.

Manual

This is a startup race, so it's awkward to trigger by hand, but, on a multi-site install with two root nodes, each with its own assigned domain/hostname, restart the app and issue concurrent requests to both hostnames. Before the fix, both can intermittently serve the first root node's content; after the fix they consistently resolve to their own root.

Copilot AI review requested due to automatic review settings June 6, 2026 08:16
@claude

claude Bot commented Jun 6, 2026

Copy link
Copy Markdown

Claude finished @AndyButland's task in 5m 19s —— View job


PR Review

Target: origin/v17/dev · Based on commit: 496dfc7a

Fixes a startup race in DomainCacheService where _initialized was set before LoadDomains() completed, causing concurrent callers to observe an empty domain cache and fall back to the first root node. The fix applies double-checked locking and swaps the rebuilt dictionary in atomically.

  • Modified public API: None — IDomainCacheService interface and DomainCacheService constructor are unchanged
  • Breaking changes: None detected

Suggestions

  • src/Umbraco.PublishedCache.HybridCache/Services/DomainCacheService.cs:26: _domains is mutated by Interlocked.Exchange (from LoadDomains() during a RefreshAll refresh) while concurrent readers access it without any synchronisation. The volatile read of _initialized in InitializeIfMissing() establishes a happens-before guarantee for the initial swap, but subsequent RefreshAll-triggered swaps have no corresponding read fence at the call sites. On x86/x64 this is fine (all reads are acquire-reads). On ARM — where .NET is increasingly deployed — the JIT may not emit LDAR for an ordinary field read without volatile, so a reader thread could theoretically hold a stale reference to the old dictionary. Marking _domains as volatile (or using Volatile.Read/Volatile.Write at the exchange site) is the strictly-correct .NET memory-model fix and has negligible runtime cost at the read sites since every path through GetAll/GetAssigned/HasAssigned already pays the cost of ConcurrentDictionary operations:

    private volatile ConcurrentDictionary<int, Domain> _domains = new();
  • tests/Umbraco.Tests.UnitTests/Umbraco.PublishedCache.HybridCache/DomainCacheServiceTests.cs:80: await Task.Delay(250) relies on wall-clock time to ensure the second task is blocked on _initializationLock before the first load is released. On a heavily loaded CI host the task scheduler may not have given the second task enough CPU time to reach the lock within 250ms, meaning releaseLoad.Set() fires before the second caller is blocked — the test would still pass (the second caller sees the already-populated cache) but would not exercise the concurrent race it claims to cover. The PR author confirms the test fails before the fix, which suggests 250ms is sufficient in practice; this is a minor nit. A fully deterministic approach would gate the second caller with its own ManualResetEventSlim to confirm it is inside GetAll before the load is released, but that would require a test seam in the SUT, which may not be worth the complexity.

  • tests/Umbraco.Tests.UnitTests/Umbraco.PublishedCache.HybridCache/DomainCacheServiceTests.cs: No test covers DomainChangeTypes.RefreshAll — the path that calls LoadDomains() from inside Refresh() via a manually-created outer scope. Particularly worth testing now that LoadDomains() uses Interlocked.Exchange — a RefreshAll payload should replace the entire cache (including dropping domains removed since last load), which is a behaviour change from the previous AddOrUpdate approach.


Approved with Suggestions

The core fix is correct and mirrors the pattern used in the sibling #22393. The double-checked lock is properly guarded by volatile _initialized and _initialized is set only after LoadDomains() completes. The atomic swap via Interlocked.Exchange is a genuine improvement over the previous in-place mutation. Tests are well-structured and the concurrency test demonstrates the regression path. The volatile note above is the only item worth considering before merge.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a startup race in DomainCacheService where concurrent first access could observe _initialized == true before the domain cache had finished loading, causing multi-site domain resolution to fall back to the first root node after an app restart.

Changes:

  • Makes domain-cache lazy initialization thread-safe via double-checked locking and only sets _initialized after LoadDomains() completes.
  • Rebuilds domains into a fresh ConcurrentDictionary and swaps it in atomically with Interlocked.Exchange to avoid transient partial/empty reads during rebuild.
  • Adds unit tests covering wildcard filtering, concurrent first-access behavior, and single-item refresh add/update semantics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/Umbraco.PublishedCache.HybridCache/Services/DomainCacheService.cs Hardens domain-cache initialization and rebuild logic against concurrent access after restart.
tests/Umbraco.Tests.UnitTests/Umbraco.PublishedCache.HybridCache/DomainCacheServiceTests.cs Adds unit coverage for the race condition and refresh/filtering behaviors.

@claude claude Bot added the area/backend label Jun 6, 2026
@kjac

kjac commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

@AndyButland this looks really good. Makes absolute sense.

I wonder if it'd be meaningful to introduce a notification handler for UmbracoApplicationStartedNotification to eager-load the domain cache (call IDomainCacheService.GetAll(true)).

Likely, other caches would also benefit from eager load, so it could be a generic-ish notification handler for that very purpose. Something like:

using Umbraco.Cms.Core.Events;
using Umbraco.Cms.Core.Notifications;
using Umbraco.Cms.Core.PublishedCache;

namespace Umbraco.Cms.Web.UI.Custom;

internal sealed class EagerLoadCriticalRuntimeCachesNotificationHandler : INotificationHandler<UmbracoApplicationStartedNotification>
{
    private readonly IDomainCacheService _domainCacheService;

    public EagerLoadCriticalRuntimeCachesNotificationHandler(IDomainCacheService domainCacheService)
        => _domainCacheService = domainCacheService;

    public void Handle(UmbracoApplicationStartedNotification notification) => _domainCacheService.GetAll(true);
}

This won't remove the race condition, since the app is ready to serve requests at this point, but it would probably lower the chance of it happening. On the flipside, it potentially masks a bug in the system.

Thoughts?

@kjac kjac self-requested a review June 7, 2026 06:37
@AndyButland

Copy link
Copy Markdown
Contributor Author

Seems a good idea to make sure that the cache is loaded before the first request. But are you suggesting this as well as the what I've proposed, or instead of? Seems even with this there's still a chance a request could jump in first, so would be best to have the race condition hardening as well.

@kjac

kjac commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Oh, definitively as an addendum to this PR 😄

@AndyButland

Copy link
Copy Markdown
Contributor Author

In which case perhaps one to consider in a separate PR to keep this focussed on fixing the race condition? As you say, could be there are others to consider, and we could look at making it part of the existing SeedingNotificationHandler (which is already responsible for seeding document and media caches after startup).

@kjac

kjac commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

All good for me. I will make a note for a future task 👍

@kjac kjac merged commit 4cc4ace into v17/dev Jun 7, 2026
27 of 28 checks passed
@kjac kjac deleted the v17/bugfix/domain-cache-init-race branch June 7, 2026 13:35
AndyButland added a commit that referenced this pull request Jun 15, 2026
…t node after restart (#23084)

* Prevent empty domain cache during concurrent initialization.

* Addressed code review comments and added further comment to the code.

* Use Lock object.
AndyButland added a commit that referenced this pull request Jun 15, 2026
…t node after restart (#23084)

* Prevent empty domain cache during concurrent initialization.

* Addressed code review comments and added further comment to the code.

* Use Lock object.
@AndyButland

Copy link
Copy Markdown
Contributor Author

Cherry-picked to release/17.5.0 and release/18.0 for 17.5.0 and 18.0.0-rc3 respectively.

iOvergaard added a commit that referenced this pull request Jun 17, 2026
IDomainService.GetAll was removed in #22629; DomainCacheServiceTests was
added later in #23084 against a stale base and still mocked the removed
method, breaking the Release build on release/18.0. Production
DomainCacheService already calls GetAllAsync, so update the four mock
setups to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants