Management API: Fix OAuth client registration permanently skipped after transient failure (closes #22356)#22368
Conversation
…r transient failure.
There was a problem hiding this comment.
Pull request overview
This PR hardens BackOfficeAuthorizationInitializationMiddleware to ensure OAuth client registration isn’t permanently skipped after a transient failure during RuntimeLevel.Upgrading, and adds unit tests to prevent regressions.
Changes:
- Release the first-request semaphore via
try/finallyto avoid deadlocks after exceptions. - Only add hosts to
_knownHostsafter successfulEnsureBackOfficeApplicationAsync, so transient failures are retried. - Add unit tests covering retry-on-failure, caching-on-success, semaphore release, and runtime-level guard behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/Umbraco.Cms.Api.Management/Middleware/BackOfficeAuthorizationInitializationMiddleware.cs |
Makes registration retryable after failures and ensures semaphore is always released. |
tests/Umbraco.Tests.UnitTests/Umbraco.Cms.Api.Management/Middleware/BackOfficeAuthorizationInitializationMiddlewareTests.cs |
Adds regression tests for retry, caching, guard clause, and semaphore-release behavior. |
...mbraco.Cms.Api.Management/Middleware/BackOfficeAuthorizationInitializationMiddlewareTests.cs
Outdated
Show resolved
Hide resolved
...mbraco.Cms.Api.Management/Middleware/BackOfficeAuthorizationInitializationMiddlewareTests.cs
Show resolved
Hide resolved
src/Umbraco.Cms.Api.Management/Middleware/BackOfficeAuthorizationInitializationMiddleware.cs
Show resolved
Hide resolved
|
I was seeing this on install when I created a new intense of Umbraco with Clean starter kit installed and run everything at once. it maybe be this brief window is extended when trying to run everything at once; install, migration, site setup. |
Migaroez
left a comment
There was a problem hiding this comment.
Haven't been able to reproduce the race condition outside of forcing it trough code as documented. Will give it a go on a slower pc later today. Either way, I don't think the code changes will do any harm and definetly improve code health and startup behaviour
Description
This PR addresses the report in #22356 of being unable to access the Swagger endpoints before a restart after an unattended install.
I haven't been able to replicate the issue with normal use. That said, analysis has uncovered a couple of issues that could be more defensively coded and might be the cause of finding the problem in practice.
Background
PR #22020 introduced
RuntimeLevel.Upgradingand moved unattended upgrades to a background service. During theUpgradingphase, the HTTP server is up and accepting requests while migrations run concurrently in the background.BackOfficeAuthorizationInitializationMiddlewareregisters OAuth clients on the first backoffice request whenRuntimeLevel >= Upgrade. TheUpgradinglevel (4) satisfies this check, so the middleware attempts registration. However, ifEnsureBackOfficeApplicationAsyncfails during this window — for example due to database contention with the concurrent migration — two bugs in the middleware make the failure permanent (or at least until a restart):Host cached before registration: The host was added to
_knownHostsbefore callingEnsureBackOfficeApplicationAsync. On failure, the host remained cached and all subsequent requests skipped registration.Semaphore leak on failure: The semaphore was released manually without
try/finally. IfEnsureBackOfficeApplicationAsyncthrew, the semaphore was never released, risking deadlocks for new hosts.Theory as to why this isn't always seen
Triggering the issue requires a backoffice request to arrive during the brief
Upgradingwindow while the background migration service holds database resources. On a local dev machine with a fresh install and no external packages, migrations typically complete near-instantly, making the window very small. But it could be hit when if a request is made during boot.After a restart,
RuntimeLevelresolves directly toRunwith no concurrent migrations, so registration succeeds — which is why the restart workaround works.How the fix resolves it
The fix is defensive — even if we can't always reproduce the exact race condition, the middleware should be resilient to transient failures:
_knownHostsis populated only afterEnsureBackOfficeApplicationAsynccompletes without throwing. If it fails, the next request retries.Test plan
Automated
Unit tests added to
BackOfficeAuthorizationInitializationMiddlewareTestsshould pass.Manual
As mentioned, I've not been able to replicate with normal use, but have used this method to deterministically reproduce the race condition: add a temporary debug hack that makes the first
EnsureBackOfficeApplicationAsynccall throw, then test with the old and new middleware code.1. Add debug hack (temporary — revert before merging)
In
src/Umbraco.Cms.Api.Management/Security/BackOfficeApplicationManager.cs, add a fail-once counter at the top ofEnsureBackOfficeApplicationAsync:2a. Confirm the bug (old middleware code on
main)Check out the original
InitializeBackOfficeAuthorizationOnceAsyncfrommainDelete the existing tokens and applications:
Run
dotnet run --project src/Umbraco.Web.UI, navigate tohttps://localhost:44339/umbraco.Simulating transient failure (attempt #1)Try to authorize via the Swagger UI and the result will be:
2b. Confirm the fix (this PR)
Switch to the fixed
InitializeBackOfficeAuthorizationOnceAsyncfrom this branch:Delete the tokens and applications again.
Run
dotnet run --project src/Umbraco.Web.UI, navigate tohttps://localhost:44339/umbraco.Simulating transient failure (attempt #1)thenRegistration attempt #2 — proceeding normally.Try to authorize via the Swagger UI and the result should be successful.
3. Cleanup
Remove the debug code.