Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
73b51bb
Models, service, repository and migration for external members.
AndyButland Mar 14, 2026
bd4823c
Integrate identity for external members in MemberUserStore.
AndyButland Mar 14, 2026
541aaf5
When autolinking external member, skip member type.
AndyButland Mar 14, 2026
08ec405
Populate profile.
AndyButland Mar 14, 2026
1bfde7f
Revoke member tokens for delivery API for external members.
AndyButland Mar 14, 2026
4c1d6cf
Audit notification handling.
AndyButland Mar 14, 2026
9a667c2
Management API updates for external members.
AndyButland Mar 14, 2026
fca3cd2
Added IMemberFilterService for combined member queries from managemen…
AndyButland Mar 14, 2026
0e8f412
Referenced by member controller with external members.
AndyButland Mar 14, 2026
79d1faf
Guard password reset for external members.
AndyButland Mar 14, 2026
bef6b3e
Remove ExternalMemberSettings.
AndyButland Mar 15, 2026
36fac79
Convert between content and external members.
AndyButland Mar 15, 2026
20815a9
Fixed ambiguous constructor.
AndyButland Mar 15, 2026
4bff61c
Update OpenApi.json.
AndyButland Mar 15, 2026
c062064
Update client SDK.
AndyButland Mar 15, 2026
89c7cac
Backoffice ui for external members.
AndyButland Mar 15, 2026
d332e5a
Refactor member collection retrievel to use presentation factory.
AndyButland Mar 16, 2026
683ad94
Merge branch 'main' into v17/feature/12741-lightweight-external-members
AndyButland Mar 16, 2026
a97ae44
Fixes from testing.
AndyButland Mar 16, 2026
1b1e61b
Merge branch 'main' into v17/feature/12741-lightweight-external-members
AndyButland Mar 17, 2026
2cc2b89
Fix icon display on member picker.
AndyButland Mar 17, 2026
c898c83
Add external member support to member picker value converter.
AndyButland Mar 17, 2026
706a773
Delete fix, sync data fix, Examine indexing, member collection defaul…
AndyButland Mar 17, 2026
c0afe24
Add cache refreshers for external members.
AndyButland Mar 17, 2026
9d0f86a
Remove unused "fast path" for just updating login properties.
AndyButland Mar 17, 2026
3b1d1ab
Addresed code review feedback.
AndyButland Mar 17, 2026
dbc5866
Further integration tests.
AndyButland Mar 17, 2026
0ea3943
Fixed failing unit test.
AndyButland Mar 17, 2026
b622cd2
Merge branch 'main' into v17/feature/12741-lightweight-external-members
AndyButland Apr 10, 2026
d851aab
Update typed client.
AndyButland Apr 10, 2026
8c639dc
Addressed code review feedback.
AndyButland Apr 10, 2026
5a20907
Merge branch 'main' into v17/feature/12741-lightweight-external-members
AndyButland Apr 14, 2026
d263fa9
Early return to reduce nesting in ReferencedByMemberController.
AndyButland Apr 14, 2026
0a0b6fc
Introduce MemberPresentationService and MemberReferenceService to mov…
AndyButland Apr 14, 2026
8a1f237
Test for and fix SQLite deadlock related to cross-store uniqueness ch…
AndyButland Apr 14, 2026
ca58635
Additional fix for the "content" member creation.
AndyButland Apr 14, 2026
9cd9c63
Defer external member Examine indexing via the background task queue.
AndyButland Apr 18, 2026
4460bee
Add update date to external member record (aligning with content memb…
AndyButland Apr 18, 2026
4e0fbc4
Add TreatLoginAsMemberUpdate config so member re-index can be skipped…
AndyButland Apr 18, 2026
6b4127c
Merge branch 'main' into v17/feature/12741-lightweight-external-members
AndyButland Apr 19, 2026
f827c8b
Add logging to help verify the indexing path chosen on login and regi…
AndyButland Apr 19, 2026
9798c11
Move ExternalMemberService into Core to align with MemberService.
AndyButland Apr 19, 2026
6c9be83
Fix deserialization issue with Json payloads.
AndyButland Apr 19, 2026
4842884
Display of external member profile data in backoffice.
AndyButland Apr 19, 2026
b4866bf
Merge branch 'main' into v17/feature/12741-lightweight-external-members
AndyButland Apr 19, 2026
adfb4a8
Fixed breaking change.
AndyButland Apr 19, 2026
8fdbce2
Merge branch 'main' into v17/feature/12741-lightweight-external-members
AndyButland Apr 20, 2026
9d3a804
Consider existing behaviour of bumping update date on login to be a b…
AndyButland Apr 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,9 +227,11 @@ Project ownership is distributed across teams. Check individual project director

1. **Layered Architecture with Dependency Inversion**
- Core defines contracts (interfaces)
- Infrastructure implements contracts
- Infrastructure implements contracts that need Infrastructure-owned machinery
- Web/APIs consume implementations via DI

**Where service implementations live**: Services whose dependencies are satisfiable from Core interfaces alone (repositories, scope, config, other Core services) live in `Umbraco.Core/Services/` — this covers the majority of domain services (`MemberService`, `ContentService`, `MediaService`, `ContentTypeService`, `EntityService`, `AuditService`, `ExternalMemberService`, etc.). Service implementations only live in `Umbraco.Infrastructure/Services/Implement/` when they genuinely need Infrastructure concerns — Examine indexes (`ContentSearchService`, `MediaSearchService`, `IndexedEntitySearchService`), log files (`LogViewerRepository`), packaging internals (`PackagingService`), webhook firing (`WebhookFiringService`), distributed-job coordination (`DistributedJobService`). When adding a new service, default to Core and only move to Infrastructure if a concrete dependency forces it.

2. **Interface-First Design**
- All services defined as interfaces in Core
- Enables testing, polymorphism, extensibility
Expand Down
244 changes: 244 additions & 0 deletions research-load-balanced-distributed-jobs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# Research: IDistributedBackgroundJob Write Lock Timeout in Load-Balanced Setup

**Issue**: [#22113](https://github.com/umbraco/Umbraco-CMS/issues/22113)
**Error**: `Failed to acquire write lock for id: -347`
**Lock -347**: `Constants.Locks.DistributedJobs` (all distributed background jobs)

---

## Summary

The root cause is most likely **SQL Server page-level lock contention** on the `umbracoLock` table, caused by long-running content operations (inside the user's distributed job) holding REPEATABLEREAD locks on one row (e.g., `-333` ContentTree) which block write access to *all other rows on the same data page* (including `-347` DistributedJobs).

This is exacerbated by:
1. **Nested scope transaction sharing** - the user's outer scope holds the transaction (and all locks) open for the entire job duration
2. **Small table, single page** - all ~18 lock rows fit on one 8KB SQL Server data page
3. **5-second write lock timeout** - the default is too short when contention exists
4. **Backoffice activity** adding further lock pressure on the same table

---

## Detailed Analysis

### The Lock Table Problem

The `umbracoLock` table has approximately 18 rows (IDs -331 through -348). In SQL Server, a standard data page is 8KB. These 18 small rows (each just `id INT`, `name NVARCHAR`, `value INT`) **all fit on a single data page**.

SQL Server's lock granularity decisions:
- For small tables, the query optimizer may choose **page-level locks** instead of row-level locks
- The `WITH (REPEATABLEREAD)` table hint in the locking SQL means locks are held until the **end of the transaction**
- Without an explicit `ROWLOCK` hint, SQL Server decides the granularity

**Read lock SQL** (from `SqlServerDistributedLockingMechanism.cs:147`):
```sql
SELECT value FROM umbracoLock WITH (REPEATABLEREAD) WHERE id=@id
```

**Write lock SQL** (from `SqlServerDistributedLockingMechanism.cs:182-183`):
```sql
UPDATE umbracoLock WITH (REPEATABLEREAD) SET value = (CASE WHEN (value=1) THEN -1 ELSE 1 END) WHERE id=@id
```

Neither uses a `ROWLOCK` hint, so SQL Server is free to use page-level locking.

### The Reproduction Scenario

Here's the exact sequence that causes the error:

**Server A** (running the user's distributed job):

1. `DistributedBackgroundJobHostedService` calls `TryTakeRunnableAsync()`
2. `TryTakeRunnableAsync` acquires `EagerWriteLock(-347)`, marks the "Clean Up Your Room" job as running, commits scope, **releases lock -347** -- this is fine
3. The user's `ExecuteAsync()` runs:
```csharp
using ICoreScope scope = _scopeProvider.CreateCoreScope(); // ROOT scope, starts transaction

_contentService.CountChildren(...) // Creates NESTED scope, acquires ReadLock(-333)
_contentService.RecycleBinSmells() // Creates NESTED scope, acquires ReadLock(-333)
_contentService.EmptyRecycleBin(...) // Creates NESTED scope, acquires WriteLock(-333)

scope.Complete(); // Transaction commits HERE, all locks released HERE
```

4. **Critical**: All nested scopes share the root scope's database/transaction (confirmed in `Scope.cs:350-360`). The `ReadLock(-333)` acquired by `CountChildren` is held until the ROOT scope disposes. If `EmptyRecycleBin` takes 30+ seconds (many items), the locks on row -333 are held for 30+ seconds.

5. With page-level locking, the shared (S) lock on row -333's **page** also covers row -347. This S lock blocks any exclusive (X) lock requests on the same page.

**Server B** (polling for jobs every 5 seconds):

6. `TryTakeRunnableAsync()` tries `EagerWriteLock(-347)`:
```sql
SET LOCK_TIMEOUT 5000;
UPDATE umbracoLock WITH (REPEATABLEREAD) SET value = ... WHERE id=-347
```
7. This UPDATE needs an exclusive (X) lock on row -347. But the page containing -347 has a shared (S) lock held by Server A's long-running transaction.
8. Server B **blocks for 5 seconds**, then gets SQL error 1222 (lock timeout)
9. This becomes: `DistributedWriteLockTimeoutException` → **"Failed to acquire write lock for id: -347"**

### Why Backoffice Login Triggers It

When users log into the backoffice and interact with content:

- **Listing content**: `ContentService.GetById/GetChildren` → `ReadLock(-333)`
- **Saving content**: `ContentService.Save` → `WriteLock(-333)`
- **Deleting content**: `ContentService.Delete/MoveToRecycleBin` → `WriteLock(-333)`
- **Publishing**: `ContentService.Publish` → `WriteLock(-333)`

Each of these acquires locks on the `umbracoLock` table. In load-balanced setups, backoffice web requests on *any server* add page-level lock contention on the same data page as -347. The more backoffice activity, the higher the probability that some transaction is holding a page lock that blocks -347 acquisition.

### Why It "Disables the Server Until Restart"

The `DistributedBackgroundJobHostedService` catches exceptions and continues (line 80). However:

1. Every 5 seconds, `TryTakeRunnableAsync` fails with the lock timeout
2. The error is logged each time, creating a flood of error logs
3. **No distributed jobs run on the affected server** because `TryTakeRunnableAsync` always times out
4. The user's custom job that's causing the contention (on the other server) eventually finishes, but by then the pattern of contention from backoffice operations may sustain the problem
5. The server appears "disabled" because its distributed job processing is effectively blocked

The server doesn't truly need a restart to recover, but the sustained contention from backoffice operations can make it *appear* permanently broken. A restart clears all in-flight transactions and ambient scopes, resolving the immediate contention.

---

## Contributing Factors

### 1. No `ROWLOCK` Hint

The distributed locking SQL uses `WITH (REPEATABLEREAD)` but not `WITH (ROWLOCK, REPEATABLEREAD)`. Adding `ROWLOCK` would force SQL Server to use row-level locks, preventing cross-row contention on the same page.

**File**: `src/Umbraco.Cms.Persistence.SqlServer/Services/SqlServerDistributedLockingMechanism.cs`
- Line 147 (read lock): `SELECT value FROM umbracoLock WITH (REPEATABLEREAD) WHERE id=@id`
- Line 182-183 (write lock): `UPDATE umbracoLock WITH (REPEATABLEREAD) SET value = ... WHERE id=@id`

### 2. Short Default Write Lock Timeout

**File**: `src/Umbraco.Core/Configuration/Models/GlobalSettings.cs`

The default write lock timeout is **5 seconds** (`DistributedLockingWriteLockDefaultTimeout`). In a load-balanced setup with active backoffice use, this is easily exceeded during page-level lock contention.

### 3. User's Outer Scope Prolongs Lock Duration

The user's code wraps multiple ContentService calls in a single scope:

```csharp
using ICoreScope scope = _scopeProvider.CreateCoreScope();
_contentService.CountChildren(...); // ReadLock(-333) acquired, held by root transaction
_contentService.RecycleBinSmells(); // ReadLock(-333)
_contentService.EmptyRecycleBin(...); // WriteLock(-333), potentially slow
scope.Complete(); // ALL locks released here
```

The nested scopes created by ContentService methods all share the root scope's transaction (`Scope.cs:350-360`). This means the ReadLock from `CountChildren` is held for the entire duration of `EmptyRecycleBin`.

### 4. `Task.Run` in User Code

The user wraps their code in `Task.Run()`:
```csharp
public Task ExecuteAsync()
{
return Task.Run(() => { ... });
}
```

While this doesn't directly cause the lock issue, `Task.Run` moves execution to a thread pool thread. This is unnecessary (the hosted service already runs on a background thread) and could cause issues with scope ambient context if the async context doesn't flow properly.

---

## Potential Fixes

### Fix 1: Add `ROWLOCK` Hint (Framework Fix - Recommended)

Add `ROWLOCK` to the SQL statements in `SqlServerDistributedLockingMechanism`:

```sql
-- Read lock
SELECT value FROM umbracoLock WITH (ROWLOCK, REPEATABLEREAD) WHERE id=@id

-- Write lock
UPDATE umbracoLock WITH (ROWLOCK, REPEATABLEREAD) SET value = ... WHERE id=@id
```

This forces SQL Server to use row-level locks, preventing cross-row contention within the same page. Row-level locks on id=-333 would NOT block row-level locks on id=-347.

**Impact**: Minimal. Row-level locks are slightly more expensive in memory (lock manager overhead) but the umbracoLock table is tiny. This is the standard best practice for small lookup tables where row independence is required.

The same fix should also be applied to the EF Core SQL Server locking mechanism:
- `src/Umbraco.Cms.Persistence.EFCore/Locking/SqlServerEFCoreDistributedLockingMechanism.cs`

### Fix 2: Separate Lock Tables (Framework Fix - More Invasive)

Move distributed job locks to a separate table (`umbracoDistributedJobLock`) so they can never share a page with content tree locks. This is more invasive but eliminates the problem entirely regardless of SQL Server lock granularity decisions.

### Fix 3: Increase Write Lock Timeout (User Workaround)

```json
{
"Umbraco": {
"CMS": {
"Global": {
"DistributedLockingWriteLockDefaultTimeout": "00:00:30"
}
}
}
}
```

Increasing to 30 seconds gives more time for the contending transaction to complete. This is a workaround, not a fix - it trades timeout frequency for longer blocking delays.

### Fix 4: User Code Improvement (User Workaround)

The user should avoid wrapping multiple ContentService calls in a single outer scope. Each ContentService method already manages its own scope:

```csharp
public Task ExecuteAsync()
{
// NO outer scope needed - each ContentService method creates its own scope
int numberOfThingsInBin = _contentService.CountChildren(Constants.System.RecycleBinContent);
_logger.LogInformation("You have {Count} items to clean", numberOfThingsInBin);

if (_contentService.RecycleBinSmells())
{
_contentService.EmptyRecycleBin(userId: -1);
}

return Task.CompletedTask;
}
```

This reduces lock hold duration because each ContentService call acquires and releases its locks independently. The `CountChildren` ReadLock(-333) is released before `EmptyRecycleBin` starts.

Also: remove the `Task.Run` wrapper - it's unnecessary since the hosted service already runs on a background thread.

---

## Key Code References

| File | Purpose |
|------|---------|
| `src/Umbraco.Infrastructure/BackgroundJobs/DistributedBackgroundJobHostedService.cs` | Timer loop, calls TryTake → Execute → Finish |
| `src/Umbraco.Infrastructure/Services/Implement/DistributedJobService.cs` | Acquires WriteLock(-347) in TryTakeRunnableAsync (line 68) and FinishAsync (line 105) |
| `src/Umbraco.Cms.Persistence.SqlServer/Services/SqlServerDistributedLockingMechanism.cs` | SQL Server lock SQL (lines 147, 182-183) - missing ROWLOCK hint |
| `src/Umbraco.Core/Persistence/Constants-Locks.cs` | Lock ID definitions (-331 through -348) |
| `src/Umbraco.Infrastructure/Scoping/Scope.cs:350-360` | Nested scopes share parent's Database/transaction |
| `src/Umbraco.Core/Services/ContentService.cs` | EmptyRecycleBin acquires WriteLock(-333), CountChildren/RecycleBinSmells acquire ReadLock(-333) |
| `src/Umbraco.Core/Configuration/Models/GlobalSettings.cs` | Default lock timeout: 5 seconds for writes |

---

## Verification Steps

To confirm this hypothesis:

1. **SQL Server Activity Monitor**: During reproduction, check for page-level locks on the `umbracoLock` table using `sys.dm_tran_locks`:
```sql
SELECT * FROM sys.dm_tran_locks
WHERE resource_database_id = DB_ID()
AND resource_associated_entity_id = OBJECT_ID('umbracoLock')
ORDER BY request_mode, resource_type
```

2. **Check lock granularity**: Look for `resource_type = 'PAGE'` entries, which would confirm page-level locking.

3. **Test with ROWLOCK**: Temporarily modify the SQL to include `ROWLOCK` hint and verify the issue disappears.

4. **Test without outer scope**: Have the user remove the wrapping `CreateCoreScope()` call and verify the issue is mitigated (shorter individual lock durations).
Loading
Loading