Migrations: Add auto upgrade coordination for load-balanced setups#22815
Conversation
|
Claude finished @nikolajlauridsen's task in 7m 59s —— View job PR ReviewTarget: Introduces a claim-based migration coordinator (
Important
Suggestions
Approved with Suggestions for improvementGood overall implementation — the design is sound (WriteLock-serialized claim store, stale-takeover fallback, belt-and-suspenders release), the test coverage is solid (state machine unit tests + real-DB integration tests including the concurrent-race test), and no breaking changes are introduced. The single Important item (unhandled transient DB error in the claim loop) should be addressed before merge; the suggestions are optional improvements. |
There was a problem hiding this comment.
Pull request overview
Adds cross-server coordination to unattended upgrades so only one instance performs schema/database migrations in load-balanced deployments, while other instances wait and then complete per-server initialization.
Changes:
- Introduces
IMigrationCoordinator/MigrationCoordinatorto elect a single migration leader viaumbracoKeyValueunder a write lock. - Updates
UnattendedUpgradeBackgroundServiceto run migrations only on the elected leader; followers wait and then rebuild in-memory state viaPostRuntimePremigrationsUpgradeNotification. - Adds new configuration (
UnattendedSettings.MigrationClaimTimeout) plus unit + integration tests for the coordinator state machine and DB-backed concurrency behavior.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Umbraco.Tests.UnitTests/Umbraco.Infrastructure/Install/UnattendedUpgradeBackgroundServiceTests.cs | Updates test wiring to inject a coordinator and keep existing tests on the “leader” path. |
| tests/Umbraco.Tests.UnitTests/Umbraco.Infrastructure/Install/MigrationCoordinatorTests.cs | Adds unit tests for claim acquisition, follower polling, cancellation, and releasing leadership. |
| tests/Umbraco.Tests.Integration/Umbraco.Infrastructure/Install/MigrationCoordinatorTests.cs | Adds integration tests verifying DB persistence, stale-claim takeover, and concurrent race behavior. |
| src/Umbraco.Infrastructure/Install/UnattendedUpgradeBackgroundService.cs | Uses the coordinator to ensure only one server runs migrations; followers perform post-premigration rebuild. |
| src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs | Implements DB-backed claim logic using umbracoKeyValue and WriteLock(Constants.Locks.KeyValues). |
| src/Umbraco.Infrastructure/Install/IMigrationCoordinator.cs | Introduces the coordinator abstraction used by the hosted service. |
| src/Umbraco.Infrastructure/DependencyInjection/UmbracoBuilder.CoreServices.cs | Registers the coordinator as a singleton and keeps the hosted service registration. |
| src/Umbraco.Core/Constants-Conventions.cs | Adds a dedicated key constant for the migration leadership claim. |
| src/Umbraco.Core/Configuration/Models/UnattendedSettings.cs | Adds MigrationClaimTimeout configuration option with a 2-hour default. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…attendedUpgradeBackgroundService Ensures DB exceptions thrown during migration coordination set BootFailed rather than faulting the background service silently.
d4db11d to
bf2ec47
Compare
AndyButland
left a comment
There was a problem hiding this comment.
Looks a solid solution to me @nikolajlauridsen. Here's a few minor code comments for consideration. Meantime I'll take a look at seeing if I can setup a test locally, once the build has completed.
Co-authored-by: Andy Butland <abutland73@gmail.com>
|
Very good feedback, thanks @AndyButland, it's much appreciated, I've gone ahead and fixed them all 😄 |
Manual testingI've manually tested the SetupTwo test sites side-by-side simulating a load-balanced pair on the same dev machine, both running the preview build that contains the coordinator ( Shared configuration:
Custom migration plan, registered identically in both projects (same plan name + same step alias, so both sites detect it as pending and coordinate on the same state row): using Umbraco.Cms.Core.Packaging;
using Umbraco.Cms.Infrastructure.Migrations;
public class SlowTestMigrationPlan : PackageMigrationPlan
{
public SlowTestMigrationPlan() : base("SlowTestPlan") { }
protected override void DefinePlan()
=> From(string.Empty).To<SlowMigration>("slow-migration-applied-v1");
}
public class SlowMigration : AsyncMigrationBase
{
public SlowMigration(IMigrationContext context) : base(context) { }
protected override Task MigrateAsync()
=> Task.Delay(TimeSpan.FromSeconds(30));
}To re-run the test against an existing database, reset the relevant DELETE FROM umbracoKeyValue
WHERE [key] IN ('Umbraco.Core.Upgrader.Lock', 'Umbraco.Core.Upgrader.State+SlowTestPlan');Start both sites in separate terminals ( Test resultsSite A — leaderSite A claimed the lock immediately, held it for the full 30 s while Site B — follower (polled, then claimed an already-released lock)Site B polled six times at exactly the configured 5-second interval (T+0 through T+25), then on the iteration immediately following Site A's release (T+30) it logged InterpretationThe safety guarantee held. Site A held the lock for the entire 30 seconds; Site B genuinely waited; only Site A actually executed There is one log-line correctness issue. Site B logged
When the leader releases right before the follower's next poll, the follower's This isn't a functional bug — the system still ends up correct, and the no-op Suggested follow-upAfter a successful claim, re-check the runtime level. If the database is already at if (TryClaimLeadership(machineIdentifier))
{
// We won the claim — but the previous leader may have released between our boot-time
// DetermineRuntimeLevel and our claim, in which case the DB is already at the target version.
try { _runtimeState.DetermineRuntimeLevel(); }
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { return false; }
catch (Exception ex) { _logger.LogWarning(ex, "Could not re-determine runtime level after claiming; will proceed as leader."); }
if (_runtimeState.Level == RuntimeLevel.Run)
{
ReleaseLeadership();
_logger.LogInformation("Migrations completed by another server; proceeding as follower.");
return false;
}
_logger.LogInformation("This server claimed migration leadership.");
return true;
}With this in place the second log block above would end with What's confirmed working
|
AndyButland
left a comment
There was a problem hiding this comment.
The updates all look good @nikolajlauridsen and I've completed testing now. I've shared the details above so you can see what I've done to verify, and also if anyone else will look at testing in future.
Other than one possible improvement around the logging it all looks to work as expected. I'll hold the "approve" for you to consider that, but if you feel it's not a concern we could just leave that and move on - it's not critical or necessarily wrong, just I found it a bit confusing when I first saw it.
|
Hey @AndyButland, thank you so much for the thorough review and testing. It's highly appreciated. I've added the fix you suggested. I agree that it's confusing, and this is a much better experience. I've also run the manual tests again and added a test covering it, so I'll go ahead and consider this as approved 👍 |
…meLevel check The winner now calls DetermineRuntimeLevel() once from the post-claim check and must see Upgrading; the loser polls twice before seeing Run. Transition the mock on the second call instead of the first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mbraco#22815) * Add auto upgrade coordination for load balanced setups * Add tests * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix(infrastructure): move TryBecomeLeaderAsync inside try/catch in UnattendedUpgradeBackgroundService Ensures DB exceptions thrown during migration coordination set BootFailed rather than faulting the background service silently. * Fix feedback * Update src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs Co-authored-by: Andy Butland <abutland73@gmail.com> * Recheck state * fix(tests): update concurrent race test for post-claim DetermineRuntimeLevel check The winner now calls DetermineRuntimeLevel() once from the post-claim check and must see Upgrading; the loser polls twice before seeing Run. Transition the mock on the second call instead of the first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andy Butland <abutland73@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…22815) * Add auto upgrade coordination for load balanced setups * Add tests * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix(infrastructure): move TryBecomeLeaderAsync inside try/catch in UnattendedUpgradeBackgroundService Ensures DB exceptions thrown during migration coordination set BootFailed rather than faulting the background service silently. * Fix feedback * Update src/Umbraco.Infrastructure/Install/MigrationCoordinator.cs Co-authored-by: Andy Butland <abutland73@gmail.com> * Recheck state * fix(tests): update concurrent race test for post-claim DetermineRuntimeLevel check The winner now calls DetermineRuntimeLevel() once from the post-claim check and must see Upgrading; the loser polls twice before seeing Run. Transition the mock on the second call instead of the first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andy Butland <abutland73@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
In load-balanced deployments with
UpgradeUnattended: true, every server previously enteredRunMigrationsAsync()concurrently. Schema-altering SQL (ALTER TABLE, CREATE INDEX, etc.) would fail on all-but-one server because the object already existed, causingBootFailedon every server except the first to finish.IMigrationCoordinator/MigrationCoordinator— a claim-based coordinator that uses an entry inumbracoKeyValue(serialised byWriteLock(Constants.Locks.KeyValues)) to elect exactly one migration leader per upgradePostRuntimePremigrationsUpgradeNotification) and component initialisationReleaseLeadership()is called in both thefinallyblock and anApplicationStoppingcallback (belt-and-suspenders for graceful Azure SIGTERM restarts)UnattendedSettings.MigrationClaimTimeout) are taken over automatically as a last resort for hard crashesWriteLockserialisation)Test plan
Add some mock upgrades to UmbracoPlan, PreMigrationPlan, etc...
UpgradeUnattended: trueUmbraco:Hosting:SiteNamein each instance'sappsettings.json— this is whatIMachineInfoFactory.GetMachineIdentifier()uses to distinguish servers on the same machineApplicationStoppingfrom releasing the migration leadership claimRunI tried some different scenarios like: