Skip to content

Migrations: Add auto upgrade coordination for load-balanced setups#22815

Merged
nikolajlauridsen merged 9 commits into
v17/devfrom
v17/feature/auto-migration-coordination-for-load-balanced-setups
May 18, 2026
Merged

Migrations: Add auto upgrade coordination for load-balanced setups#22815
nikolajlauridsen merged 9 commits into
v17/devfrom
v17/feature/auto-migration-coordination-for-load-balanced-setups

Conversation

@nikolajlauridsen
Copy link
Copy Markdown
Contributor

@nikolajlauridsen nikolajlauridsen commented May 12, 2026

Summary

In load-balanced deployments with UpgradeUnattended: true, every server previously entered RunMigrationsAsync() concurrently. Schema-altering SQL (ALTER TABLE, CREATE INDEX, etc.) would fail on all-but-one server because the object already existed, causing BootFailed on every server except the first to finish.

  • Introduces IMigrationCoordinator / MigrationCoordinator — a claim-based coordinator that uses an entry in umbracoKeyValue (serialised by WriteLock(Constants.Locks.KeyValues)) to elect exactly one migration leader per upgrade
  • The leader runs the full migration sequence; followers wait until the leader finishes, then perform their own per-server in-memory rebuild (PostRuntimePremigrationsUpgradeNotification) and component initialisation
  • ReleaseLeadership() is called in both the finally block and an ApplicationStopping callback (belt-and-suspenders for graceful Azure SIGTERM restarts)
  • Stale claims (default 2-hour timeout, configurable via UnattendedSettings.MigrationClaimTimeout) are taken over automatically as a last resort for hard crashes
  • Adds 11 unit tests (state machine logic) and 4 integration tests (real DB write/read, stale claim takeover, concurrent race verifying WriteLock serialisation)

Test plan

Add some mock upgrades to UmbracoPlan, PreMigrationPlan, etc...

  1. Point two Umbraco instances at the same database with a pending migration and UpgradeUnattended: true
  2. Give each instance a distinct identity by setting a different Umbraco:Hosting:SiteName in each instance's appsettings.json — this is what IMachineInfoFactory.GetMachineIdentifier() uses to distinguish servers on the same machine
  3. Important: start each instance with Ctrl+F5 in Rider (Run without debugging), not F5. Stopping with F5 sends SIGKILL, which bypasses the graceful shutdown and prevents ApplicationStopping from releasing the migration leadership claim
  4. Start both instances simultaneously and verify: only one runs migrations, the other waits and initialises normally, and both reach Run

I tried some different scenarios like:

  • Slow migration, leader stops halfway through -> Follower gets elected
  • Happy Path with slow migration -> Leader finishes, follower starts normally
  • Exception happens -> Leader crashes, follower attempts, follower crashes, expected
  • Leader dies with SIGKILL -> leadership isn't released, this is expected, but because the leader identity remains the same, it's elected after a reboot

Copilot AI review requested due to automatic review settings May 12, 2026 10:41
@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Claude finished @nikolajlauridsen's task in 7m 59s —— View job


PR Review

Target: origin/main · Based on commit: 762869f1 · Skipped: ~40 noise files (non-English lang files, package-lock.json) out of ~90 total in merge ref; 9 reviewable production/test files

Introduces a claim-based migration coordinator (IMigrationCoordinator / MigrationCoordinator) that serializes schema migrations in load-balanced unattended-upgrade deployments — exactly one server writes migrations while followers poll and then rebuild per-server in-memory state.

  • No breaking changes detected. All new types are internal; UnattendedSettings.MigrationClaimTimeout and Constants.Conventions.Migrations.UpgradeLockKey are purely additive public additions with defaults.

Important

  • src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs:50: TryClaimLeadership is called outside any try/catch in the TryBecomeLeaderAsync loop. A transient DB error (network blip, lock timeout) would propagate uncaught and permanently abort the coordination — the while loop exits and TryBecomeLeaderAsync throws. In contrast, DetermineRuntimeLevel() on line 57 is wrapped in a try-catch that logs and retries. For consistency and resilience the claim attempt should also be retried on error:
    try
    {
        if (TryClaimLeadership(machineIdentifier))
        {
            _logger.LogInformation("This server claimed migration leadership.");
            return true;
        }
    }
    catch (Exception ex)
    {
        _logger.LogWarning(ex, "Failed to check migration leadership claim; will retry.");
    }

Suggestions

  • src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs:84 / UnattendedUpgradeBackgroundService.cs:77: ReleaseLeadership() can be called concurrently from two paths — the ApplicationStopping callback (line 79 of the service) and the finally block (line 101). _leaderClaim is read-check-then-nulled without synchronization. The DB operations are idempotent (the WriteLock serializes them) but the check-and-null is a data race on a singleton. Using Interlocked.Exchange makes the "release exactly once" intent explicit and safe:

    public void ReleaseLeadership()
    {
        string? claim = Interlocked.Exchange(ref _leaderClaim, null);
        if (claim is null) return;
    
        using ICoreScope scope = _scopeProvider.CreateCoreScope();
        scope.WriteLock(Constants.Locks.KeyValues);
        string? current = _keyValueService.GetValue(Constants.Conventions.Migrations.UpgradeLockKey);
        if (current == claim)
            _keyValueService.SetValue(Constants.Conventions.Migrations.UpgradeLockKey, string.Empty);
        scope.Complete();
    }

    Note: the _leaderClaim field would need volatile or the variable should be a dedicated field for Interlocked to work without surprises.

  • tests/Umbraco.Tests.UnitTests/Umbraco.Infrastructure/Install/MigrationCoordinatorTests.cs:43 and tests/Umbraco.Tests.Integration/Umbraco.Infrastructure/Install/MigrationCoordinatorTests.cs:49: Test methods use the MethodUnderTest_WhenCondition_ExpectedBehavior naming convention; the project convention (per coding-preferences.md) is Can_/Cannot_ with PascalCase underscores (e.g., Can_ClaimLeadership_WhenNoClaimExists). Not a correctness issue but inconsistent with the rest of the test suite.


Approved with Suggestions for improvement

Good overall implementation — the design is sound (WriteLock-serialized claim store, stale-takeover fallback, belt-and-suspenders release), the test coverage is solid (state machine unit tests + real-DB integration tests including the concurrent-race test), and no breaking changes are introduced. The single Important item (unhandled transient DB error in the claim loop) should be addressed before merge; the suggestions are optional improvements.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds cross-server coordination to unattended upgrades so only one instance performs schema/database migrations in load-balanced deployments, while other instances wait and then complete per-server initialization.

Changes:

  • Introduces IMigrationCoordinator/MigrationCoordinator to elect a single migration leader via umbracoKeyValue under a write lock.
  • Updates UnattendedUpgradeBackgroundService to run migrations only on the elected leader; followers wait and then rebuild in-memory state via PostRuntimePremigrationsUpgradeNotification.
  • Adds new configuration (UnattendedSettings.MigrationClaimTimeout) plus unit + integration tests for the coordinator state machine and DB-backed concurrency behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/Umbraco.Tests.UnitTests/Umbraco.Infrastructure/Install/UnattendedUpgradeBackgroundServiceTests.cs Updates test wiring to inject a coordinator and keep existing tests on the “leader” path.
tests/Umbraco.Tests.UnitTests/Umbraco.Infrastructure/Install/MigrationCoordinatorTests.cs Adds unit tests for claim acquisition, follower polling, cancellation, and releasing leadership.
tests/Umbraco.Tests.Integration/Umbraco.Infrastructure/Install/MigrationCoordinatorTests.cs Adds integration tests verifying DB persistence, stale-claim takeover, and concurrent race behavior.
src/Umbraco.Infrastructure/Install/UnattendedUpgradeBackgroundService.cs Uses the coordinator to ensure only one server runs migrations; followers perform post-premigration rebuild.
src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs Implements DB-backed claim logic using umbracoKeyValue and WriteLock(Constants.Locks.KeyValues).
src/Umbraco.Infrastructure/Install/IMigrationCoordinator.cs Introduces the coordinator abstraction used by the hosted service.
src/Umbraco.Infrastructure/DependencyInjection/UmbracoBuilder.CoreServices.cs Registers the coordinator as a singleton and keeps the hosted service registration.
src/Umbraco.Core/Constants-Conventions.cs Adds a dedicated key constant for the migration leadership claim.
src/Umbraco.Core/Configuration/Models/UnattendedSettings.cs Adds MigrationClaimTimeout configuration option with a 2-hour default.

Comment thread src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs Outdated
Comment thread src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs Outdated
Comment thread src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs Outdated
Comment thread src/Umbraco.Infrastructure/Install/UnattendedUpgradeBackgroundService.cs Outdated
Comment thread src/Umbraco.Core/Configuration/Models/UnattendedSettings.cs Outdated
Comment thread src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs
@claude claude Bot added the area/backend label May 12, 2026
@nikolajlauridsen nikolajlauridsen changed the base branch from main to v17/dev May 13, 2026 07:04
@nikolajlauridsen nikolajlauridsen changed the base branch from v17/dev to main May 13, 2026 07:04
nikolajlauridsen and others added 5 commits May 13, 2026 11:39
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…attendedUpgradeBackgroundService

Ensures DB exceptions thrown during migration coordination set BootFailed
rather than faulting the background service silently.
@nikolajlauridsen nikolajlauridsen force-pushed the v17/feature/auto-migration-coordination-for-load-balanced-setups branch from d4db11d to bf2ec47 Compare May 13, 2026 09:41
@nikolajlauridsen nikolajlauridsen changed the base branch from main to v17/dev May 13, 2026 09:41
Copy link
Copy Markdown
Contributor

@AndyButland AndyButland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a solid solution to me @nikolajlauridsen. Here's a few minor code comments for consideration. Meantime I'll take a look at seeing if I can setup a test locally, once the build has completed.

Comment thread src/Umbraco.Core/Configuration/Models/UnattendedSettings.cs Outdated
Comment thread src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs Outdated
Comment thread src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs
Comment thread src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs
nikolajlauridsen and others added 2 commits May 13, 2026 12:37
@nikolajlauridsen
Copy link
Copy Markdown
Contributor Author

Very good feedback, thanks @AndyButland, it's much appreciated, I've gone ahead and fixed them all 😄

@AndyButland
Copy link
Copy Markdown
Contributor

Manual testing

I've manually tested the MigrationCoordinator against two Umbraco sites sharing a single SQL Server database, using a slow custom package migration to make the leader/follower roles observable. The coordinator's mutual-exclusion guarantee held, the follower correctly polled while the leader worked, and I've identified one log-line correctness issue that might be worth a small follow-up tweak.

Setup

Two test sites side-by-side simulating a load-balanced pair on the same dev machine, both running the preview build that contains the coordinator (Umbraco.Cms version 17.6.0--rc.preview.18.gb0cbb97).

Shared configuration:

  • Single SQL Server database Umbraco17UpgradeTest on .\SQLEXPRESS, referenced by both sites' appsettings.json.

  • Umbraco:CMS:Unattended:UpgradeUnattended: true in both sites.

  • Distinct Umbraco:CMS:Hosting:SiteName per site (Test17A / Test17B) so IMachineInfoFactory.GetMachineIdentifier() returns a unique value per instance. Without this both processes share a machine identifier, which triggers the same-machine reclaim shortcut in TryClaimLeadership.

  • Distinct ports per site.

  • Debug-level override scoped to the coordinator only, so the polling cadence is visible without flooding the rest of the log:

    "Serilog": {
      "MinimumLevel": {
        "Default": "Information",
        "Override": {
          "Umbraco.Cms.Infrastructure.Install.MigrationCoordinator": "Debug"
        }
      }
    }

Custom migration plan, registered identically in both projects (same plan name + same step alias, so both sites detect it as pending and coordinate on the same state row):

using Umbraco.Cms.Core.Packaging;
using Umbraco.Cms.Infrastructure.Migrations;

public class SlowTestMigrationPlan : PackageMigrationPlan
{
    public SlowTestMigrationPlan() : base("SlowTestPlan") { }

    protected override void DefinePlan()
        => From(string.Empty).To<SlowMigration>("slow-migration-applied-v1");
}

public class SlowMigration : AsyncMigrationBase
{
    public SlowMigration(IMigrationContext context) : base(context) { }

    protected override Task MigrateAsync()
        => Task.Delay(TimeSpan.FromSeconds(30));
}

To re-run the test against an existing database, reset the relevant umbracoKeyValue rows:

DELETE FROM umbracoKeyValue
WHERE [key] IN ('Umbraco.Core.Upgrader.Lock', 'Umbraco.Core.Upgrader.State+SlowTestPlan');

Start both sites in separate terminals (dotnet run started within ~1 second of each other) and observe each site's console.

Test results

Site A — leader

[12:54:50 INF] Acquiring MainDom.
[12:54:50 INF] Acquired MainDom.
[12:54:52 INF] Unattended upgrade background service started.
[12:54:52 INF] This server claimed migration leadership.
[12:54:52 INF] Starting package migration for SlowTestPlan [Timing ac259d4]
[12:54:52 INF] Starting 'SlowTestPlan'...
[12:54:52 INF] At origin
[12:54:52 INF] Execute SlowMigration
[12:55:22 INF] At slow-migration-applied-v1
[12:55:22 INF] Done
[12:55:22 INF] Package migration completed for SlowTestPlan (30033ms) [Timing ac259d4]
[12:55:23 INF] Unattended upgrade completed successfully.

Site A claimed the lock immediately, held it for the full 30 s while SlowMigration ran, and released on completion. The migration took 30,033 ms — i.e. the real work.

Site B — follower (polled, then claimed an already-released lock)

[12:54:53 INF] Acquiring MainDom.
[12:54:53 INF] Acquired MainDom.
[12:54:54 INF] Unattended upgrade background service started.
[12:54:55 DBG] Waiting for migration leader to finish...    ← T+0
[12:55:00 DBG] Waiting for migration leader to finish...    ← T+5
[12:55:05 DBG] Waiting for migration leader to finish...    ← T+10
[12:55:10 DBG] Waiting for migration leader to finish...    ← T+15
[12:55:15 DBG] Waiting for migration leader to finish...    ← T+20
[12:55:20 DBG] Waiting for migration leader to finish...    ← T+25
[12:55:25 INF] This server claimed migration leadership.    ← T+30
[12:55:25 INF] Starting package migration for SlowTestPlan [Timing 43b670a]
[12:55:25 INF] Starting 'SlowTestPlan'...
[12:55:25 INF] At slow-migration-applied-v1
[12:55:25 INF] Done
[12:55:25 INF] Package migration completed for SlowTestPlan (6ms) [Timing 43b670a]
[12:55:25 INF] Unattended upgrade completed successfully.

Site B polled six times at exactly the configured 5-second interval (T+0 through T+25), then on the iteration immediately following Site A's release (T+30) it logged This server claimed migration leadership. and ran SlowTestPlan to completion in 6 ms — i.e. a no-op, because plan state was already slow-migration-applied-v1.

Interpretation

The safety guarantee held. Site A held the lock for the entire 30 seconds; Site B genuinely waited; only Site A actually executed SlowMigration. The 6 ms vs 30,033 ms gap between the two migration runs is the unambiguous signal that mutual exclusion worked.

There is one log-line correctness issue. Site B logged This server claimed migration leadership. even though it had nothing to migrate. This happens because of the iteration order in TryBecomeLeaderAsync:

  1. Try to claim the lock.
  2. If that succeeded, return true immediately.
  3. Otherwise re-check RuntimeLevel and either return false as follower, or sleep and retry.

When the leader releases right before the follower's next poll, the follower's TryClaimLeadership succeeds (lock key is empty) and returns true before step 3 ever runs. The follower never gets a chance to notice that the work is already done — it claims an empty lock, logs as leader, and then RunMigrationsAsync no-ops because plan state is already advanced.

This isn't a functional bug — the system still ends up correct, and the no-op RunMigrationsAsync is harmless. But the log line is misleading: it claims B was the leader when in reality A did all the work.

Suggested follow-up

After a successful claim, re-check the runtime level. If the database is already at Run, release the just-acquired claim and log/return as follower:

if (TryClaimLeadership(machineIdentifier))
{
    // We won the claim — but the previous leader may have released between our boot-time
    // DetermineRuntimeLevel and our claim, in which case the DB is already at the target version.
    try { _runtimeState.DetermineRuntimeLevel(); }
    catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { return false; }
    catch (Exception ex) { _logger.LogWarning(ex, "Could not re-determine runtime level after claiming; will proceed as leader."); }

    if (_runtimeState.Level == RuntimeLevel.Run)
    {
        ReleaseLeadership();
        _logger.LogInformation("Migrations completed by another server; proceeding as follower.");
        return false;
    }

    _logger.LogInformation("This server claimed migration leadership.");
    return true;
}

With this in place the second log block above would end with Migrations completed by another server; proceeding as follower. instead of the misleading leader line, and the no-op RunMigrationsAsync call would be skipped entirely.

What's confirmed working

  • Distinct SiteName produces distinct machine identifiers and avoids the same-machine reclaim shortcut.
  • The polling loop runs at the configured 5-second interval and respects cancellation.
  • WriteLock(KeyValues) correctly serialises the read-then-write across two SQL Server connections.
  • ReleaseLeadership clears the lock key on the leader and only the leader.
  • The follower's wait time is bounded by the leader's actual migration duration (here ~30 s), not by the stale-claim timeout.

Copy link
Copy Markdown
Contributor

@AndyButland AndyButland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates all look good @nikolajlauridsen and I've completed testing now. I've shared the details above so you can see what I've done to verify, and also if anyone else will look at testing in future.

Other than one possible improvement around the logging it all looks to work as expected. I'll hold the "approve" for you to consider that, but if you feel it's not a concern we could just leave that and move on - it's not critical or necessarily wrong, just I found it a bit confusing when I first saw it.

@nikolajlauridsen
Copy link
Copy Markdown
Contributor Author

Hey @AndyButland, thank you so much for the thorough review and testing. It's highly appreciated.

I've added the fix you suggested. I agree that it's confusing, and this is a much better experience. I've also run the manual tests again and added a test covering it, so I'll go ahead and consider this as approved 👍

…meLevel check

The winner now calls DetermineRuntimeLevel() once from the post-claim check
and must see Upgrading; the loser polls twice before seeing Run. Transition
the mock on the second call instead of the first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nikolajlauridsen nikolajlauridsen merged commit 80e2764 into v17/dev May 18, 2026
26 of 27 checks passed
@nikolajlauridsen nikolajlauridsen deleted the v17/feature/auto-migration-coordination-for-load-balanced-setups branch May 18, 2026 10:06
idseefeld pushed a commit to idseefeld/Umbraco-CMS that referenced this pull request May 19, 2026
…mbraco#22815)

* Add auto upgrade coordination for load balanced setups

* Add tests

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix(infrastructure): move TryBecomeLeaderAsync inside try/catch in UnattendedUpgradeBackgroundService

Ensures DB exceptions thrown during migration coordination set BootFailed
rather than faulting the background service silently.

* Fix feedback

* Update src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs

Co-authored-by: Andy Butland <abutland73@gmail.com>

* Recheck state

* fix(tests): update concurrent race test for post-claim DetermineRuntimeLevel check

The winner now calls DetermineRuntimeLevel() once from the post-claim check
and must see Upgrading; the loser polls twice before seeing Run. Transition
the mock on the second call instead of the first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andy Butland <abutland73@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
nikolajlauridsen added a commit that referenced this pull request May 20, 2026
…22815)

* Add auto upgrade coordination for load balanced setups

* Add tests

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix(infrastructure): move TryBecomeLeaderAsync inside try/catch in UnattendedUpgradeBackgroundService

Ensures DB exceptions thrown during migration coordination set BootFailed
rather than faulting the background service silently.

* Fix feedback

* Update src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs

Co-authored-by: Andy Butland <abutland73@gmail.com>

* Recheck state

* fix(tests): update concurrent race test for post-claim DetermineRuntimeLevel check

The winner now calls DetermineRuntimeLevel() once from the post-claim check
and must see Upgrading; the loser polls twice before seeing Run. Transition
the mock on the second call instead of the first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andy Butland <abutland73@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants