Skip to content

Management API: Fix OAuth client registration permanently skipped after transient failure (closes #22356)#22368

Merged
AndyButland merged 3 commits intomainfrom
v17/bugfix/22356-oauth-client-registrations
Apr 9, 2026
Merged

Management API: Fix OAuth client registration permanently skipped after transient failure (closes #22356)#22368
AndyButland merged 3 commits intomainfrom
v17/bugfix/22356-oauth-client-registrations

Conversation

@AndyButland
Copy link
Copy Markdown
Contributor

@AndyButland AndyButland commented Apr 7, 2026

Description

This PR addresses the report in #22356 of being unable to access the Swagger endpoints before a restart after an unattended install.

I haven't been able to replicate the issue with normal use. That said, analysis has uncovered a couple of issues that could be more defensively coded and might be the cause of finding the problem in practice.

Background

PR #22020 introduced RuntimeLevel.Upgrading and moved unattended upgrades to a background service. During the Upgrading phase, the HTTP server is up and accepting requests while migrations run concurrently in the background.

BackOfficeAuthorizationInitializationMiddleware registers OAuth clients on the first backoffice request when RuntimeLevel >= Upgrade. The Upgrading level (4) satisfies this check, so the middleware attempts registration. However, if EnsureBackOfficeApplicationAsync fails during this window — for example due to database contention with the concurrent migration — two bugs in the middleware make the failure permanent (or at least until a restart):

  1. Host cached before registration: The host was added to _knownHosts before calling EnsureBackOfficeApplicationAsync. On failure, the host remained cached and all subsequent requests skipped registration.

  2. Semaphore leak on failure: The semaphore was released manually without try/finally. If EnsureBackOfficeApplicationAsync threw, the semaphore was never released, risking deadlocks for new hosts.

Theory as to why this isn't always seen

Triggering the issue requires a backoffice request to arrive during the brief Upgrading window while the background migration service holds database resources. On a local dev machine with a fresh install and no external packages, migrations typically complete near-instantly, making the window very small. But it could be hit when if a request is made during boot.

After a restart, RuntimeLevel resolves directly to Run with no concurrent migrations, so registration succeeds — which is why the restart workaround works.

How the fix resolves it

The fix is defensive — even if we can't always reproduce the exact race condition, the middleware should be resilient to transient failures:

  • Host cached only after success: _knownHosts is populated only after EnsureBackOfficeApplicationAsync completes without throwing. If it fails, the next request retries.
  • Semaphore in try/finally: The semaphore is always released, preventing deadlocks.

Test plan

Automated

Unit tests added to BackOfficeAuthorizationInitializationMiddlewareTests should pass.

Manual

As mentioned, I've not been able to replicate with normal use, but have used this method to deterministically reproduce the race condition: add a temporary debug hack that makes the first EnsureBackOfficeApplicationAsync call throw, then test with the old and new middleware code.

1. Add debug hack (temporary — revert before merging)

In src/Umbraco.Cms.Api.Management/Security/BackOfficeApplicationManager.cs, add a fail-once counter at the top of EnsureBackOfficeApplicationAsync:

// DEBUG: Remove before merging.
private static int _debugCallCount;

public async Task EnsureBackOfficeApplicationAsync(
    IEnumerable<Uri> backOfficeHosts, CancellationToken cancellationToken = default)
{
    // --- START DEBUG BLOCK (remove before merging) ---
    if (Interlocked.Increment(ref _debugCallCount) == 1)
    {
        _logger.LogWarning(
            "=== DEBUG: Simulating transient failure (attempt #{Count}) ===",
            _debugCallCount);
        await Task.Delay(100, cancellationToken);
        throw new Exception("DEBUG: Simulated database contention during upgrade");
    }
    _logger.LogWarning(
        "=== DEBUG: Registration attempt #{Count} — proceeding normally ===",
        _debugCallCount);
    // --- END DEBUG BLOCK ---

    // ... rest of method unchanged ...

2a. Confirm the bug (old middleware code on main)

Check out the original InitializeBackOfficeAuthorizationOnceAsync from main

Delete the existing tokens and applications:

  DELETE FROM umbracoOpenIddictTokens
  DELETE FROM umbracoOpenIddictAuthorizations
  DELETE FROM umbracoOpenIddictApplications WHERE ClientId IN ('umbraco-swagger', 'umbraco-postman')

Run dotnet run --project src/Umbraco.Web.UI, navigate to https://localhost:44339/umbraco.

  • Log shows: Simulating transient failure (attempt #1)
  • Refresh — no second log line. The host was cached before the throw, so the middleware skips retry.

Try to authorize via the Swagger UI and the result will be:

error:invalid_request
error_description:The specified 'client_id' is invalid.
error_uri:https://documentation.openiddict.com/errors/ID2052

2b. Confirm the fix (this PR)

Switch to the fixed InitializeBackOfficeAuthorizationOnceAsync from this branch:

Delete the tokens and applications again.

Run dotnet run --project src/Umbraco.Web.UI, navigate to https://localhost:44339/umbraco.

  • Log shows two lines: Simulating transient failure (attempt #1) then Registration attempt #2 — proceeding normally.

Try to authorize via the Swagger UI and the result should be successful.

3. Cleanup

Remove the debug code.

Copilot AI review requested due to automatic review settings April 7, 2026 14:19
@AndyButland AndyButland marked this pull request as draft April 7, 2026 14:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens BackOfficeAuthorizationInitializationMiddleware to ensure OAuth client registration isn’t permanently skipped after a transient failure during RuntimeLevel.Upgrading, and adds unit tests to prevent regressions.

Changes:

  • Release the first-request semaphore via try/finally to avoid deadlocks after exceptions.
  • Only add hosts to _knownHosts after successful EnsureBackOfficeApplicationAsync, so transient failures are retried.
  • Add unit tests covering retry-on-failure, caching-on-success, semaphore release, and runtime-level guard behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/Umbraco.Cms.Api.Management/Middleware/BackOfficeAuthorizationInitializationMiddleware.cs Makes registration retryable after failures and ensures semaphore is always released.
tests/Umbraco.Tests.UnitTests/Umbraco.Cms.Api.Management/Middleware/BackOfficeAuthorizationInitializationMiddlewareTests.cs Adds regression tests for retry, caching, guard clause, and semaphore-release behavior.

@AndyButland AndyButland marked this pull request as ready for review April 8, 2026 13:50
@hifi-phil
Copy link
Copy Markdown
Contributor

I was seeing this on install when I created a new intense of Umbraco with Clean starter kit installed and run everything at once. it maybe be this brief window is extended when trying to run everything at once; install, migration, site setup.

Copy link
Copy Markdown
Contributor

@Migaroez Migaroez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't been able to reproduce the race condition outside of forcing it trough code as documented. Will give it a go on a slower pc later today. Either way, I don't think the code changes will do any harm and definetly improve code health and startup behaviour

@AndyButland AndyButland merged commit 8d25312 into main Apr 9, 2026
26 of 27 checks passed
@AndyButland AndyButland deleted the v17/bugfix/22356-oauth-client-registrations branch April 9, 2026 12:00
AndyButland added a commit that referenced this pull request Apr 9, 2026
…er transient failure (closes #22356) (#22368)

* Prevent OAuth client registration from being permanently skipped after transient failure.

* Addressed code review feedback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants