perf(engine): reduce lock contention in scheduling and hook caches (#5686)#5693
perf(engine): reduce lock contention in scheduling and hook caches (#5686)#5693
Conversation
…5686) Three hot-path contention points observed in CPU profiles of ~1000-test runs (Monitor.Enter_Slowpath at 4.33% exclusive, large GetQueuedCompletionStatus idle stalls): 1. BeforeHookTaskCache.GetOrCreateBeforeClassTask and AfterHookPairTracker.GetOrCreateAfterClassTask guarded all classes behind one shared Lock, serialising unrelated classes. Replaced with the existing ThreadSafeDictionary<,>.GetOrAdd pattern (already used for GetOrCreateBeforeAssemblyTask), which internally uses Lazy<T> with ExecutionAndPublication to guarantee single-execution per key without a shared monitor. Keeps a lock-free TryGetValue fast path so the common cache-hit case doesn't allocate a closure. 2. TestScheduler.ExecuteTestsAsync unlimited branch called Parallel.ForEachAsync without MaxDegreeOfParallelism, which defaults to ProcessorCount but still queues near-simultaneous continuations on large test sets, saturating the IOCP thread. Explicit cap at Environment.ProcessorCount * 2 converges with the already-bounded limited-parallelism path. EventReceiverRegistry single-lock (item 3 in the issue) left for a follow-up as it's lower priority and the two above cover the bulk of the observed contention.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | -7 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
Good performance-focused PR addressing real contention hotspots identified via CPU profiling. The hook cache change is solid. One behavioral concern worth discussing before merging.
TestScheduler.cs — Behavioral Regression: Unlimited < Default
Concern: UnlimitedPathMaxDop = ProcessorCount * 2 creates a counterintuitive inversion.
The default path (no --maximum-parallel-tests flag) returns ProcessorCount * 4 (line 492). But a user who explicitly opts for unlimited (--maximum-parallel-tests 0) now gets ProcessorCount * 2 — which is more restrictive than doing nothing. This is a silent regression for anyone who has opted in to unlimited parallelism.
// Default path (no flag): ProcessorCount * 4
var defaultLimit = Environment.ProcessorCount * 4;
// Explicit "unlimited" (--maximum-parallel-tests 0): ProcessorCount * 2 ← now more restrictive
private static readonly int UnlimitedPathMaxDop = Environment.ProcessorCount * 2;The intent is correct (prevent IOCP saturation), but the cap should be at least as high as the default (ProcessorCount * 4), or ideally configurable. A better approach would be to unify the concept: if the goal is a sensible upper bound for async I/O-bound workloads, apply it consistently in GetMaxParallelism rather than as a second cap in ExecuteTestsAsync. For instance, mapping int.MaxValue to ProcessorCount * 4 in GetMaxParallelism itself would mean users see it in one place and ExecuteTestsAsync stays DOP-agnostic.
Incorrect comment: The current comment says "Parallel.ForEachAsync defaults to ProcessorCount" — this is wrong. When MaxDegreeOfParallelism is -1 (the .NET default), Parallel.ForEachAsync is truly unbounded, not capped at ProcessorCount. The actual problem is the unbounded depth, which the fix correctly addresses, but the comment should say "truly unlimited" not "defaults to ProcessorCount".
BeforeHookTaskCache / AfterHookPairTracker — Solid Improvement
Removing the shared _classLock and delegating to ThreadSafeDictionary.GetOrAdd is the right call. LazyThreadSafetyMode.ExecutionAndPublication inside ThreadSafeDictionary guarantees single-execution per key without serializing unrelated classes. The pattern is already established for the assembly-level tasks, so this is a welcome consistency fix.
The outer TryGetValue fast path before calling GetOrAdd is justified: ThreadSafeDictionary.GetOrAdd allocates a Lazy<T> instance before calling ConcurrentDictionary.GetOrAdd, so the outer check avoids that allocation plus the closure on the common cache-hit path. The comment accurately explains this.
One minor note: the closure passed to GetOrAdd captures cancellationToken from the calling thread. In a race where two threads both miss the TryGetValue fast path, the winning factory uses whichever thread's token got there first. This mirrors the old lock-based behaviour and is fine — just worth knowing if debugging a cancellation edge case.
The [UnconditionalSuppressMessage("Trimming", "IL2077")] suppressions are valid: the [DynamicallyAccessedMembers] annotation is present at the public method boundary, so the trimmer has the information it needs even though it doesn't flow into the lambda. The justification comments make this clear enough for future maintainers.
Summary
| Area | Assessment |
|---|---|
| Hook caches (Before/After class) | ✅ Correct, clear, consistent with existing patterns |
TestScheduler UnlimitedPathMaxDop |
|
| AOT / trimming annotations | ✅ Suppressions are valid and well-justified |
The hook cache change can merge as-is. I'd suggest revisiting the DOP cap value (or moving the cap into GetMaxParallelism) before this lands to avoid surprising users who opted into unlimited.
- Align unlimited path with default path: both now cap at ProcessorCount * 4 so opting into unlimited isn't a regression. - Correct comment about Parallel.ForEachAsync default behaviour.
There was a problem hiding this comment.
Follow-up Review
Previous feedback has been fully addressed in the second commit. Here's the updated assessment.
Previous Concerns — All Resolved ✅
"Unlimited < Default" inversion: Fixed. The 0-means-unlimited case now maps to defaultLimit = ProcessorCount * 4 (same as the default path) via the refactored GetMaxParallelism. The separate UnlimitedPathMaxDop = ProcessorCount * 2 constant is gone.
Incorrect Parallel.ForEachAsync comment: Fixed. The comment now correctly states "truly unbounded" rather than "defaults to ProcessorCount".
Architecture After Second Commit
The simplification is good. ExecuteTestsAsync is now:
private Task ExecuteTestsAsync(AbstractExecutableTest[] tests, CancellationToken cancellationToken)
=> ExecuteWithGlobalLimitAsync(tests, _maxParallelismSemaphore.Value, cancellationToken);Single code path, no conditional branching, no dead netstandard2.0 fallback path lingering. The async keyword removal is also correct — no need for a state machine when you're just forwarding a task.
One Remaining Consideration (Non-blocking)
True "unlimited" is no longer possible. Users who set --maximum-parallel-tests 0 or TUNIT_MAX_PARALLEL_TESTS=0 explicitly to opt into unlimited parallelism (e.g., network-bound test suites with high latency where more concurrency is beneficial) now silently get ProcessorCount * 4. The log message tells them this happened, which is good. But if any user genuinely needs beyond ProcessorCount * 4, the escape hatch is now to pass an explicit large number (e.g., --maximum-parallel-tests 64). Worth a brief mention in release notes or docs, but not a blocker.
The SemaphoreSlim is now always allocated (previously null for the unlimited case). The overhead of WaitAsync/Release at ProcessorCount * 4 slots is negligible in practice — it will rarely block on typical machines — so this trade-off is fine.
Hook Caches — No Changes Needed
The BeforeHookTaskCache and AfterHookPairTracker changes from the first commit are correct and clean. The per-key Lazy pattern via ThreadSafeDictionary is the right approach and consistent with the assembly-level tasks already in the codebase.
Summary
| Area | Status |
|---|---|
| DOP cap inversion (unlimited < default) | ✅ Fixed |
Incorrect ForEachAsync comment |
✅ Fixed |
| Code path unification | ✅ Cleaner than before |
| Hook caches (Before/After class) | ✅ No change needed |
| True unlimited escape hatch gone | ℹ️ Acknowledged tradeoff, not a blocker |
This is in good shape. The profiling-driven changes are well-motivated and the second commit addressed all the structural concerns cleanly.
Summary
Closes #5686.
CPU profile of ~1000-test runs showed
Monitor.Enter_Slowpathat 4.33% exclusive and a 48.5%GetQueuedCompletionStatusidle cliff. Two structural issues addressed:BeforeHookTaskCache.GetOrCreateBeforeClassTaskandAfterHookPairTracker.GetOrCreateAfterClassTaskguarded all classes behind a singleLock, serialising unrelated classes. Replaced with the existingThreadSafeDictionary<,>.GetOrAddpattern (already used forGetOrCreateBeforeAssemblyTask), which internally usesLazy<T>withExecutionAndPublicationand guarantees single-execution per key with no shared monitor. A lock-freeTryGetValuefast path keeps cache-hit calls closure-allocation-free.TestScheduler.ExecuteTestsAsynccalledParallel.ForEachAsyncwithoutMaxDegreeOfParallelismon the--maximum-parallel-tests-unset branch. Default isProcessorCountbut large test sets still queue near-simultaneous continuations and saturate IOCP. Added explicitMaxDegreeOfParallelism = Environment.ProcessorCount * 2, converging with the already-bounded limited path.Item 3 in the issue (
EventReceiverRegistrysingle-lock) was left for a follow-up as it's marked lower priority.Test plan
dotnet build TUnit.Engine/TUnit.Engine.csprojacross net8.0/net9.0/net10.0/netstandard2.0 — 0 warnings, 0 errors (no trimmer complaints on IL2077 annotation flow through the new closure).TUnit.UnitTests(net10.0): 180/180 pass.TUnit.Engine.Tests(net10.0): 245/247 pass; the 2 failures are pre-existing environmental (FSharp/VB integration tests require those sub-projects to be built locally and are unrelated to this change).TUnit.TestProject/HookExecutorTests(net10.0): 5/5 pass — verifies Before/After class hooks still execute correctly through the new pattern.