Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| CodeStyle | 2 minor |
🟢 Metrics 7 complexity
Metric Results Complexity 7
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Review: fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests (#5700)
The fix is correct and well-motivated. The scheduler previously treated Parallel and KeyedNotInParallel as sequential phases, which was an observable bug: an unconstrained [Test] would always drain fully before any [NotInParallel("key")] test started, even though the keyed constraint only governs tests sharing that key. The regression test clearly demonstrates the failure, and the wall-clock evidence (6.4s → 4.3s) is a nice bonus.
A few observations worth discussing:
1. Timing-based regression test is fragile
The Repro5700 test asserts _test1ObservedTest2 > 0 by relying on Test1 running for 3 s (60 × 50 ms) while Test2 runs twice in serial (2 × 1 s = 2 s). This works on a normally loaded machine, but:
- On a heavily loaded CI runner, a slow OS scheduler might delay
Test2's first invocation pastTest1's completion, making the assertion flaky. - The test tracks a counter (
_test1ObservedTest2 > 0) rather than a flag, which accumulates noise and doesn't add much beyond a simplebool _overlapped = false.
A more robust approach would be to use a SemaphoreSlim or TaskCompletionSource as a rendezvous: Test1 sets a signal once it is active, Test2 waits for it before sleeping, or vice-versa — proving overlap without depending on wall-clock timing at all. The static fields also need volatile or to be accessed exclusively through Interlocked to avoid data races on _test2ConcurrentViolations and _test1ObservedTest2 (they are read without Volatile.Read in AssertParallelismShape, though on current x86/ARM the memory model would likely not bite you in practice).
2. concurrentPhases.ToArray() allocation inside the hot path
WaitForTasksWithFailFastHandling already accepts IEnumerable<Task>, so the ToArray() call is only needed because the signature forces it. A small nit, but since this is called every time tests are scheduled you could either:
// Option A: pass the list directly
await WaitForTasksWithFailFastHandling(concurrentPhases, cancellationToken).ConfigureAwait(false);…by changing the signature to IEnumerable<Task>, or
// Option B: use a fixed-size span/array from the start given the capacity is known (at most 2)
Task? parallelTask = null, keyedTask = null;…and construct only what you need. The List<Task>(2) + ToArray() dance allocates twice for a maximum of two elements. A pair of nullable Task? fields would be zero-allocation here.
3. Asymmetric concurrentPhases vs tasks naming
The ConstrainedParallelGroups loop (line 226–237) uses the same pattern with a local tasks list and ToArray(). Now that ExecuteGroupedTestsAsync has two separate code sites doing the same thing, this is a good moment to extract a small helper:
private async Task RunConcurrentlyAsync(Task? a, Task? b, CancellationToken ct)
{
if (a is null && b is null) return;
if (a is null) { await b!.ConfigureAwait(false); return; }
if (b is null) { await a.ConfigureAwait(false); return; }
await WaitForTasksWithFailFastHandling([a, b], ct).ConfigureAwait(false);
}This removes the List<Task> allocations at both call sites and makes the intent obvious.
4. Missing coverage for the symmetric case
The PR tests that an unconstrained test (Parallel bucket) can run alongside a KeyedNotInParallel test. But the symmetric scenario — a KeyedNotInParallel test from key A running alongside a KeyedNotInParallel test from key B — is not covered. Both end up in the same concurrentPhases bucket and are dispatched together to ConstraintKeyScheduler, which serializes only within the same key. A test asserting that two differently-keyed [NotInParallel] tests overlap each other would round out the coverage.
5. _test1Active field is tracked but never asserted
_test1Active is incremented/decremented in Test1_RunsAlongsideKeyedTest2 but is not checked in AssertParallelismShape. If the field was meant to be a guard (e.g., to detect if Test1 exited early), it should be asserted; otherwise it can be removed to reduce noise.
Summary
The core scheduling change is correct and the trade-offs (ParallelGroups still serialize cross-group, global NotInParallel still drains last) are sound. The main concerns are the fragility of the timing-based regression test and two small allocation opportunities in the hot path. None of these block the fix, but the test fragility in particular is worth addressing before this lands on a slow CI machine.
There was a problem hiding this comment.
Code Review: fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests
Summary
The fix is correct and well-motivated. Moving KeyedNotInParallel execution to run concurrently with the Parallel bucket (via Task.WhenAll through WaitForTasksWithFailFastHandling) properly reflects the semantics: a [NotInParallel("key")] test only conflicts with other tests sharing that key, so it should not be blocked by unrelated unconstrained tests draining first.
The two concurrent tasks share no mutable state — ExecuteTestsAsync uses Parallel.ForEachAsync with a global DOP cap, while ConstraintKeyScheduler.ExecuteTestsWithConstraintsAsync protects its own lockedKeys/waitingTestIndex under its own lockObject. The .AsTask() call on the returned ValueTask is also correct (required before adding to the task list).
Issues
1. Repro5700 is timing-sensitive — add [Retry(3)] (HIGH)
Test1 polls _test2Active every 50 ms for 3 seconds. The assertion _test1ObservedTest2 > 0 relies on Test2 starting its first invocation while Test1 is still running. On a loaded CI runner this margin could be violated. Looking at KeyedNotInParallelTests and NotInParallelMixedTests in the project — both carry [Retry(3)] for exactly this reason. Repro5700 has no [Retry] guard, making it a CI flakiness risk.
Better approach: Add [Retry(3)] at minimum. A structurally correct alternative is a rendezvous mechanism (SemaphoreSlim, ManualResetEventSlim, or TaskCompletionSource) so Test1 signals it is active and Test2 waits for that signal, proving overlap deterministically rather than by wall-clock coincidence.
2. Inconsistent Volatile.Read in AssertParallelismShape (MEDIUM)
The test carefully uses Volatile.Read(ref _test2Active) inside the polling loop (line 35), but reads _test1ObservedTest2 (line 68) and _test2ConcurrentViolations (line 67) in the [After(Class)] hook without Volatile.Read. On the .NET memory model the [After(Class)] happens-after all test tasks, so this is safe in practice, but it is formally inconsistent with the rest of the class. Worth fixing for correctness-by-inspection:
Assert.That(Volatile.Read(ref _test2ConcurrentViolations), Is.EqualTo(0), ...);
Assert.That(Volatile.Read(ref _test1ObservedTest2), Is.GreaterThan(0), ...);3. Pre-existing tasks.ToArray() inconsistency at line 237 (LOW)
The new code at line ~198 correctly passes concurrentPhases (a List<Task>) directly to WaitForTasksWithFailFastHandling(IEnumerable<Task>, ...) — no unnecessary allocation. However, the unchanged ConstrainedParallelGroups block at line ~237 still does tasks.ToArray() before passing to the same method. The PR is a good opportunity to fix this asymmetry and remove the redundant copy:
// Before
await WaitForTasksWithFailFastHandling(tasks.ToArray(), cancellationToken);
// After
await WaitForTasksWithFailFastHandling(tasks, cancellationToken);4. Missing cross-key [NotInParallel] concurrency test (LOW)
Repro5700 validates Parallel ↔ KeyedNotInParallel overlap. It does not verify that [NotInParallel("A")] tests run concurrently with [NotInParallel("B")] tests (different keys, no shared constraint). This is logically guaranteed by ConstraintKeyScheduler's design, but a targeted test would complete the cross-bucket regression coverage introduced by this PR.
5. Dynamic queue gap (pre-existing, worth noting)
ProcessDynamicTestQueueAsync handles Parallel and NotInParallel dynamic tests but silently ignores groupedDynamicTests.KeyedNotInParallel. This predates this PR, but since this PR introduces the concurrent-phase model for KeyedNotInParallel, it is a good time to flag this for a follow-up issue.
Previous Review
The prior claude[bot] review raised 5 observations. None have been addressed yet (same commit f2f6147). Note that observation #2 in that review (about ToArray() allocation) was incorrect as applied to the new code — the PR correctly avoids ToArray() in the new path.
Verdict
The core fix is correct and safe to merge. The primary blocker before landing is adding [Retry(3)] to Repro5700 to match the established pattern for timing-sensitive scheduler tests in this project. The other items are lower priority but worth addressing in this PR or in a follow-up.
There was a problem hiding this comment.
Code Review: fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests (#5700)
Summary
The fix is correct, minimal, and well-tested. The root cause — sequential phase draining in ExecuteGroupedTestsAsync — is addressed cleanly by firing both the unconstrained and keyed-NotInParallel tasks concurrently and awaiting them together.
What's Good
Correct diagnosis and fix. The bug was that ExecuteGroupedTestsAsync fully awaited the Parallel bucket before starting KeyedNotInParallel. The refactor to RunPhasesConcurrentlyAsync is the minimal, right-sized fix.
Excellent test design. Using TaskCompletionSource rendezvous patterns in Repro5700 and CrossKeyOverlap5700 instead of wall-clock sleeps makes the tests deterministic on slow CI. Both tests fail fast and deterministically on the broken scheduler. This is exactly how concurrent scheduling tests should be written.
The RunPhasesConcurrentlyAsync helper is clean. The null-guard short-circuit paths avoid unnecessary Task.WhenAll allocation when one bucket is empty — a minor but correct performance consideration.
Issues Worth Addressing
1. Dynamic tests with KeyedNotInParallel are silently dropped — ProcessDynamicTestQueueAsync
File: TestScheduler.cs lines 274–284 and 305–313
ProcessDynamicTestQueueAsync only handles groupedDynamicTests.Parallel and groupedDynamicTests.NotInParallel:
if (groupedDynamicTests.Parallel.Length > 0)
await ExecuteTestsAsync(groupedDynamicTests.Parallel, cancellationToken).ConfigureAwait(false);
if (groupedDynamicTests.NotInParallel.Length > 0)
await ExecuteSequentiallyAsync(groupedDynamicTests.NotInParallel, cancellationToken).ConfigureAwait(false);There's no handling for groupedDynamicTests.KeyedNotInParallel. Any dynamically-generated test with [NotInParallel("key")] would be silently skipped — it won't execute at all.
This is a pre-existing bug, not introduced by this PR. But since the PR is auditing the scheduling phases, it's worth surfacing: ProcessDynamicTestQueueAsync should either add the missing KeyedNotInParallel handling (mirroring ExecuteGroupedTestsAsync) or add a comment acknowledging the limitation.
Suggestion:
if (groupedDynamicTests.KeyedNotInParallel.Length > 0)
{
await _constraintKeyScheduler
.ExecuteTestsWithConstraintsAsync(groupedDynamicTests.KeyedNotInParallel, cancellationToken)
.ConfigureAwait(false);
}2. RunPhasesConcurrentlyAsync — misleading cancellationToken parameter
File: TestScheduler.cs lines 396–401
private Task RunPhasesConcurrentlyAsync(Task? a, Task? b, CancellationToken cancellationToken)
{
if (a is null) return b ?? Task.CompletedTask;
if (b is null) return a;
return WaitForTasksWithFailFastHandling([a, b], cancellationToken);
}The fast-path returns (a is null / b is null) don't use cancellationToken at all — the token is already baked into the tasks by the time they arrive. Inside WaitForTasksWithFailFastHandling, the token is only checked after an exception to detect fail-fast mode, not to cancel the wait itself.
This is functionally fine, but the signature implies the token can influence what happens — it can't. Either drop the parameter (and pass the token only into WaitForTasksWithFailFastHandling at the call site), or add a brief comment explaining why it's only used for fail-fast detection:
// cancellationToken is checked only for fail-fast exception propagation inside
// WaitForTasksWithFailFastHandling; tasks are already started and carry their own token.3. ParallelGroups sequential loop — worth documenting
File: TestScheduler.cs lines 198–215
The ParallelGroups loop still processes each group sequentially (await ExecuteTestsAsync(...)). This is intentional (cross-group exclusion), but the intent differs from the two newly-concurrent phases above it. A short comment here would prevent a future reader from assuming it's another instance of the pre-fix bug:
// Each parallel group runs to completion before the next — tests within a group
// are bucketed together precisely because they must not overlap.
foreach (var group in groupedTests.ParallelGroups) { ... }Minor Observation
CloudShop timeout bump (25s → 60s) — unrelated to the scheduling fix. Bumping timeouts to suppress flakiness without understanding the root cause tends to mask real problems. If Step3_Worker_Fulfills_Order was intermittently hitting 25s, that may indicate an ordering/async issue in the test itself rather than a slow CI box. Worth investigating separately rather than silently widening the window.
Verdict
The core fix is solid. The two real concerns are:
ProcessDynamicTestQueueAsyncsilently dropsKeyedNotInParalleldynamic tests (pre-existing gap, worth fixing in a follow-up if not now).RunPhasesConcurrentlyAsyncCancellationToken semantics are slightly misleading (low risk, easy to clarify with a comment).
The rendezvous-based tests are the right way to validate concurrent scheduling behavior. Approved conceptually — addressing item 1 would make this a complete fix rather than a partial one.
Address PR #5740 review item: ProcessDynamicTestQueueAsync only handled Parallel and global NotInParallel — dynamically-added tests with [NotInParallel(key)], [ParallelGroup(...)], or constrained-parallel-group attributes were silently dropped. Extract ExecuteAllPhasesAsync from ExecuteGroupedTestsAsync and call it from both the main session path and the dynamic batch loop, so any future bucket added to GroupedTests is automatically respected on both paths. Also: clarify cancellationToken role in RunPhasesConcurrentlyAsync and document the intentional sequential ParallelGroups loop.
There was a problem hiding this comment.
Code Review — Round 4 (commit 749b58b)
What's Been Fixed (Previous Reviews Addressed)
All major concerns from the three prior claude[bot] reviews have been resolved:
- ✅ Timing-based test fragility —
Repro5700now usesTaskCompletionSourcerendezvous instead of wall-clock polling. This is the correct approach. - ✅ Missing cross-key coverage —
CrossKeyOverlap5700covers the symmetric[NotInParallel("A")]↔[NotInParallel("B")]scenario. - ✅ Dynamic queue gap —
ExecuteDynamicBatchAsyncroutes dynamic tests throughExecuteAllPhasesAsync, so dynamically-generated[NotInParallel("key")]tests now execute correctly instead of being silently dropped. - ✅
List<Task>+ToArray()allocation — eliminated at both call sites byRunPhasesConcurrentlyAsync. - ✅
RunPhasesConcurrentlyAsynccancellation token comment — added; explains why the fast-path returns don't inspect the token. - ✅
ParallelGroupssequential-by-design comment — added.
Remaining Concern
CloudShop timeout bump (25s → 60s) — no root cause identified
File: examples/CloudShop/CloudShop.Tests/Tests/Orders/OrderWorkflowTests.cs
The comment says "Generous timeout so the assertion doesn't trip on a busy CI runner where RabbitMQ delivery + Worker processing can lag behind under concurrent load." But the scheduling change here introduces more concurrency, not less — unconstrained and keyed-NotInParallel tests now run simultaneously where they previously ran sequentially. If Step3_Worker_Fulfills_Order is in a group with tests that now run concurrently against it, that could explain a real regression; but the comment doesn't establish that causal chain.
Three possible cases:
- Pre-existing flakiness, unrelated to this PR — the bump is papering over a known fragility.
- Real regression — the concurrent phase execution causes resource contention in CI (e.g., shared Docker/RabbitMQ) that inflates processing time.
- Coincidence — slow CI run happened to reveal an edge case during review.
Why this matters: masking CI flakiness with a generous timeout is a pattern that tends to compound. 60s for a polling assertion is very wide. If this is a real regression from the scheduling change, it should be understood and addressed. If it's pre-existing, it should be filed as a separate issue rather than silently widened.
Suggestion: Either trace why this test now needs 60s (and document it in the PR), or revert the timeout and open a separate issue for the flakiness. A comment in the test explaining the failure mode would help future readers.
Overall Assessment
The core scheduling fix is correct, well-reasoned, and now backed by deterministic regression tests — the right shape for a scheduling correctness bug. The extraction of ExecuteAllPhasesAsync and ExecuteDynamicBatchAsync makes the phase pipeline reusable for dynamic tests without duplication. The RunPhasesConcurrentlyAsync helper is clean and allocation-conscious.
The CloudShop timeout is the one unresolved item. It's not a blocker if the cause is understood and accepted, but it should be explicitly acknowledged rather than left ambiguous.
OptOut_DoesNotTag_AspNetCoreSpans regressed under #5740's keyed/Parallel overlap. Each TracerProvider's AddAspNetCoreInstrumentation subscribes to the process-global Microsoft.AspNetCore ActivitySource, so a parallel auto-OTel WebApplicationTest's correlation processor stamps spans the OptOut exporter then captures. Keyed serialization could only enumerate sibling auto-wire tests; switch OptOut to global [NotInParallel] until per-factory filtering lands. Also bump CloudShop Step3 timeout to 120s — Worker fulfillment via RabbitMQ lags further under the new keyed/Parallel concurrency.
There was a problem hiding this comment.
Code Review — Round 5 (commit efbf7b1)
Previous Review Items — All Resolved
All concerns from rounds 1–4 have been addressed:
- ✅ Timing-based test fragility —
Repro5700usesTaskCompletionSourcerendezvous. Deterministic. - ✅ Missing cross-key coverage —
CrossKeyOverlap5700covers[NotInParallel("A")]↔[NotInParallel("B")]overlap. - ✅ Dynamic queue gap —
ExecuteDynamicBatchAsyncroutes dynamic tests throughExecuteAllPhasesAsync, so keyed dynamic tests no longer silently drop. - ✅
List<Task>+ToArray()allocation — eliminated byRunPhasesConcurrentlyAsync. - ✅ CancellationToken semantics comment — added and explains the fast-path correctly.
- ✅
ParallelGroupssequential-by-design comment — added.
New in this commit (efbf7b1)
OTel Isolation Redesign — Correct
The two OTel test classes have asymmetric assertion semantics:
AutoConfigureOpenTelemetryTestsasserts tag presence — a foreign span that also carries aTestIdtag does not produce a false negative. No isolation needed. ✅AutoConfigureOpenTelemetryOptOutTestsasserts tag absence — a parallel factory's correlation processor stamps spans before they reach this exporter, producing false negatives. Global[NotInParallel]is the right constraint until per-factory filtering lands. ✅
The docstrings explain both decisions clearly. The global drain for OptOut is a performance tax but unavoidable given the process-global ActivitySource architecture. Good call.
CloudShop Timeout: 25s → 60s → 120s — Still Unresolved (and getting worse)
File: examples/CloudShop/CloudShop.Tests/Tests/Orders/OrderWorkflowTests.cs
The commit message at least names the cause: "Worker fulfillment via RabbitMQ lags further under the new keyed/Parallel concurrency." The causal chain is now documented, which is progress from round 4.
However, the symptom is worsening: the timeout has inflated 4.8× across this PR. A 120-second polling window for a RabbitMQ delivery + Worker processing cycle is a strong signal that the increased concurrency is saturating shared CI resources (Docker, RabbitMQ throughput, CPU). This has two practical risks:
- CI suite wall-clock time — a test that can legitimately wait 120 seconds adds 2 minutes to every CI run where the slow path is hit, multiplied across parallel workers.
- Masked regression — if the test regresses for a real reason (e.g., a broken message handler), the first symptom is a 120-second hang rather than a quick fail.
Better approach: Investigate whether the CloudShop integration tests should be isolated from the scheduler-level concurrency increase. Options:
- Add
[NotInParallel](or a shared key) to the CloudShop test class so it does not compete with CPU/IO-heavy suite tests. - Run the CloudShop E2E suite as a separate CI step with its own resource budget.
- At minimum, document in the test file that the 120s is a scheduling-contention accommodation and open a tracking issue.
120 seconds is too wide to leave undocumented. The current comment only says "busy CI runner" — it should name the root cause (keyed/Parallel concurrency increase) and link the PR.
Verdict
The scheduling fix is correct, well-tested, and all structural concerns from prior reviews are resolved. The OTel isolation logic is sound. The one remaining concern is the CloudShop timeout, which has been widened further rather than root-caused. It is not a correctness blocker, but I'd recommend either isolating the CloudShop tests from the concurrent scheduler or opening a tracking issue before this merges so the timeout doesn't silently grow further.
Address PR #5740 review item: ProcessDynamicTestQueueAsync only handled Parallel and global NotInParallel — dynamically-added tests with [NotInParallel(key)], [ParallelGroup(...)], or constrained-parallel-group attributes were silently dropped. Extract ExecuteAllPhasesAsync from ExecuteGroupedTestsAsync and call it from both the main session path and the dynamic batch loop, so any future bucket added to GroupedTests is automatically respected on both paths. Also: clarify cancellationToken role in RunPhasesConcurrentlyAsync and document the intentional sequential ParallelGroups loop.
OptOut_DoesNotTag_AspNetCoreSpans regressed under #5740's keyed/Parallel overlap. Each TracerProvider's AddAspNetCoreInstrumentation subscribes to the process-global Microsoft.AspNetCore ActivitySource, so a parallel auto-OTel WebApplicationTest's correlation processor stamps spans the OptOut exporter then captures. Keyed serialization could only enumerate sibling auto-wire tests; switch OptOut to global [NotInParallel] until per-factory filtering lands. Also bump CloudShop Step3 timeout to 120s — Worker fulfillment via RabbitMQ lags further under the new keyed/Parallel concurrency.
efbf7b1 to
8ee609f
Compare
There was a problem hiding this comment.
Code Review — Round 6 (commit efbf7b1 → current)
All concerns from rounds 1–5 have been addressed. This is a well-executed fix.
Core fix (TestScheduler.cs) — ✅
The architectural approach is correct: Parallel and KeyedNotInParallel buckets now start concurrently, ParallelGroups still serialize cross-group, and global NotInParallel still drains last. The phase ordering matches the documented semantics in parallelism.md.
RunPhasesConcurrentlyAsync is clean — null-guard short-circuits avoid allocating a Task[] when one bucket is empty, and the token-forwarding comment accurately describes why the parameter exists. No issues here.
Dynamic test routing (ExecuteDynamicBatchAsync) — ✅
Routing dynamic tests through ExecuteAllPhasesAsync was the right call. The pre-existing silent-drop for KeyedNotInParallel dynamic tests is now fixed as a side-effect. The recursive structure is safe: nested ExecuteAllPhasesAsync calls do not touch _dynamicTestQueue.Complete(), so the queue lifecycle remains owned by ExecuteGroupedTestsAsync.
Tests (Repro5700, CrossKeyOverlap5700) — ✅
Both rendezvous tests are deterministic and will fail fast on the bug rather than depending on wall-clock timing. AssertKeyedSerialization correctly verifies the constraint still serializes within its key. The engine-level wrapper tests in KeyedNotInParallelCrossPhaseTests give CI-level regression coverage.
OTel isolation redesign — ✅
The asymmetric handling (presence assertion vs. absence assertion) is architecturally correct. Removing the keyed constraint from AutoConfigureOpenTelemetryTests and promoting AutoConfigureOpenTelemetryOptOutTests to global [NotInParallel] is the right trade-off given the process-global ActivityListener model.
CloudShop timeout (25 s → 120 s) — remaining concern
The comment now explains the proximate cause (RabbitMQ delivery + Worker processing lag under increased concurrent load). That is appreciated. However, a 4.8× timeout inflation is still a yellow flag worth tracking:
- The increased concurrency from this scheduler fix is by design — CloudShop's integration tests are now exposed to more parallelism. If 120 s is needed to survive that, it suggests either (a) the Worker/RabbitMQ stack has a saturation point that CI regularly hits, or (b) the test and the service under test were implicitly depending on the old serialized scheduler.
- A test with a 120 s ceiling that times out on CI doesn't give a useful signal — it just eats 2 minutes before failing.
Suggestion: open a follow-up issue (or add a // TODO(#XXXX) to the comment) to investigate whether Step3_Worker_Fulfills_Order should be isolated from the concurrent scheduler (e.g. via [NotInParallel] on that test or its class) or whether the Worker/RabbitMQ health-check polling interval should be tightened. The comment as-is explains the symptom but doesn't give the next person a path forward.
This is a low-priority blocker — the fix itself is correct and safe to merge. The CloudShop timeout is a quality-of-life follow-up, not a correctness issue.
Verdict: ✅ Ready to merge, with the CloudShop timeout tracked as a follow-up.
…rently Discussion #5700: with `[Test] T1` and `[Test, NotInParallel("k")] T2`, T1 was expected to run alongside T2 (per docs/execution/parallelism.md: "Tests with no common key may still run concurrently"), but in practice T1 always finished before T2 started. Cause: TestScheduler.ExecuteGroupedTestsAsync awaited each grouping bucket in sequence. The Parallel bucket (no constraints) drained fully before the KeyedNotInParallel bucket began, so unconstrained tests never overlapped keyed ones. Run those two buckets concurrently — they share no synchronization state. ParallelGroups still serialize (cross-group exclusion) and the global NotInParallel bucket still drains last (must run alone). Adds a regression test (Bugs/5700) that asserts Test1 observes Test2 active during its execution, plus an engine-level wrapper test.
…overage - Repro5700 now uses TaskCompletionSource rendezvous instead of a 3s wall-clock sample window so the assertion is deterministic on slow CI runners. Test1 signals once it is live and waits for any Test2 invocation; each Test2 invocation does the inverse. Either side timing out flags the bug. - New CrossKeyOverlap5700 fixture covers the symmetric scenario (two [NotInParallel] tests with different keys must overlap). - Extract `RunPhasesConcurrentlyAsync(Task?, Task?, CancellationToken)` helper in TestScheduler. Unifies the new Parallel + KeyedNotInParallel call site with the existing ConstrainedParallelGroups loop and removes the per-phase `List<Task>` + `ToArray()` dance (zero allocation in the common single-phase case). - Bump CloudShop OrderWorkflow Step3 WaitsFor timeout from 25s → 60s. The worker eventual-consistency window was already borderline; under the new cross-phase concurrency the RabbitMQ + worker pipeline shares CPU with unconstrained tests and 25s no longer fits a busy CI runner.
Address PR #5740 review item: ProcessDynamicTestQueueAsync only handled Parallel and global NotInParallel — dynamically-added tests with [NotInParallel(key)], [ParallelGroup(...)], or constrained-parallel-group attributes were silently dropped. Extract ExecuteAllPhasesAsync from ExecuteGroupedTestsAsync and call it from both the main session path and the dynamic batch loop, so any future bucket added to GroupedTests is automatically respected on both paths. Also: clarify cancellationToken role in RunPhasesConcurrentlyAsync and document the intentional sequential ParallelGroups loop.
OptOut_DoesNotTag_AspNetCoreSpans regressed under #5740's keyed/Parallel overlap. Each TracerProvider's AddAspNetCoreInstrumentation subscribes to the process-global Microsoft.AspNetCore ActivitySource, so a parallel auto-OTel WebApplicationTest's correlation processor stamps spans the OptOut exporter then captures. Keyed serialization could only enumerate sibling auto-wire tests; switch OptOut to global [NotInParallel] until per-factory filtering lands. Also bump CloudShop Step3 timeout to 120s — Worker fulfillment via RabbitMQ lags further under the new keyed/Parallel concurrency.
8ee609f to
670d82c
Compare
There was a problem hiding this comment.
Round 7 — Approval
All concerns from rounds 1–6 are resolved. The implementation is correct, well-tested, and production-ready.
What's right
Core fix (TestScheduler.cs) — Clean and correct. ExecuteAllPhasesAsync properly runs the Parallel and KeyedNotInParallel buckets concurrently while keeping ParallelGroups cross-group serialization and global NotInParallel drain-last semantics intact. The code comment precisely documents why sequential ordering was the bug (#5700). ✓
RunPhasesConcurrentlyAsync — Null-guard short-circuits avoid Task[] allocation on common single-phase paths. The comment explaining the CancellationToken forwarding intent is exactly the kind of non-obvious note worth keeping. ✓
ExecuteDynamicBatchAsync — Routing dynamic tests through ExecuteAllPhasesAsync instead of two bare if-blocks fixes the pre-existing silent-drop for [NotInParallel(key)] dynamically-generated tests. Both call sites now share one implementation. ✓
Regression tests — Repro5700 uses a TaskCompletionSource rendezvous (no wall-clock polling), CrossKeyOverlap5700 covers the symmetric different-keys scenario, and KeyedNotInParallelCrossPhaseTests wraps both at the engine level with 30-second hard timeouts. Deterministic and correct. ✓
OTel isolation (AutoConfigureOpenTelemetryTests) — The rationale split is accurate: presence-assertion class needs no key, absence-assertion class correctly uses keyless [NotInParallel] with a thorough doc-comment explaining the process-global ActivitySource architecture. ✓
One remaining nit (non-blocking)
The CloudShop timeout comment explains the cause well:
// Generous timeout so the assertion doesn't trip on a busy CI runner where
// RabbitMQ delivery + Worker processing can lag behind under concurrent load.
timeout: TimeSpan.FromSeconds(120),But it doesn't give the next reader a path forward — a 4.8× inflation from 25 s → 120 s is worth tracking. Adding a // TODO(#XXXX): investigate whether [NotInParallel] on this class (or a dedicated CI step) can bring the timeout back down pointing to a follow-up issue would close the loop. This is a minor nit and should not block the merge.
Verdict: Approve. The bug fix is correct, the tests are deterministic, and all prior review items are resolved.
- AutoConfigureOpenTelemetryTests: OTel exporter Adds to _exported on a background batch thread without locking, so the test's poll loop racing on List<Activity> threw "Collection was modified". Wrap in a small LockedActivityList (ICollection<Activity>) and iterate Snapshot() copies. - Repro5700 / CrossKeyOverlap5700: subprocess runs every [EngineTest=Pass] test (hundreds) so the 15s rendezvous deadline races MaxDOP saturation on slow macOS runners. Bump to 60s — the bug being guarded would still manifest as the rendezvous never resolving, not as a 60s delay.
There was a problem hiding this comment.
Code Review — Round 8 (commit a11e766)
Previous Review Status
Round 7 posted an approval covering commits up through . Commit (fix(tests): unbreak CI on PR #5740) landed after that approval and hasn't had a dedicated pass. This review covers only the delta.
New in commit a11e766
1. LockedActivityList — Correct
The fix for the Collection was modified race is sound. OpenTelemetry's in-memory exporter calls ICollection<T>.Add on a background batch-processor thread with no synchronisation guarantee, so a plain List<Activity> racy with the test's poll loop was always unsafe. Wrapping every mutation in lock (_items) and exposing a Snapshot() copy for iteration is the minimal, correct solution.
One minor note: GetEnumerator() delegates to ((IEnumerable<Activity>)Snapshot()).GetEnumerator(), which allocates a new array and an enumerator object on every foreach. Since all test-level callers already use Snapshot() directly (and the class is only used in a test), this is not a real concern — just worth noting if the pattern is copied elsewhere.
2. Rendezvous timeout 15s → 60s — Correct rationale, but outer CTS mismatch
Bumping the RendezvousTimeout constant in Repro5700 and CrossKeyOverlap5700 from 15s to 60s is justified: the test harness runs the full [EngineTest=Pass] set, and under MaxDOP saturation on a slow runner 15s wasn't enough budget.
However: KeyedNotInParallelCrossPhaseTests wraps each inner test with a hard CancellationTokenSource(TimeSpan.FromSeconds(30)):
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
await RunTestsWithFilter("/*/*/Repro5700/*", [...], new RunOptions().WithForcefulCancellationToken(cts.Token));The inner rendezvous now allows up to 60s, but the outer engine-test wrapper cancels after 30s. On a slow runner where the rendezvous needs more than 30s, the subprocess gets force-cancelled and the engine test fails — not because the bug is back, but because the scheduling deadline fired before the rendezvous resolved. The outer CTS should be at least as wide as the inner rendezvous timeout (60s), with some headroom:
// Outer budget must exceed the inner RendezvousTimeout (60 s) to avoid
// the subprocess being cancelled before the rendezvous resolves.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(90));This is the one actionable issue in this commit.
Remaining Nit from Round 7 (unchanged)
The CloudShop timeout comment:
// Generous timeout so the assertion doesn't trip on a busy CI runner where
// RabbitMQ delivery + Worker processing can lag behind under concurrent load.
timeout: TimeSpan.FromSeconds(120),The cause is documented (keyed/Parallel concurrency increase), but there's still no follow-up path. A // TODO: investigate [NotInParallel] on this class or a dedicated CI step — see PR #5740 would give the next reader a breadcrumb. Low priority, not a blocker.
Verdict
The LockedActivityList fix is correct and the rendezvous timeout increase is justified. The outer 30s CTS vs inner 60s rendezvous mismatch in KeyedNotInParallelCrossPhaseTests is a real reliability risk that should be fixed before merge — bumping to 90s (or removing the outer CTS and relying solely on the inner rendezvous deadline) resolves it. Everything else from rounds 1–7 remains resolved.
There was a problem hiding this comment.
Code Review — Round 8 (commit a11e766)
Previous Review Status
Round 7 posted an approval covering commits up through 670d82c. Commit a11e766 (fix(tests): unbreak CI on PR #5740) landed after that approval and hasn't had a dedicated pass. This review covers only the delta.
New in commit a11e766
1. LockedActivityList — Correct
The fix for the Collection was modified race is sound. OpenTelemetry's in-memory exporter calls ICollection<T>.Add on a background batch-processor thread with no synchronisation guarantee, so a plain List<Activity> racing with the test's poll loop was always unsafe. Wrapping every mutation in lock (_items) and exposing a Snapshot() copy for iteration is the minimal, correct solution.
One minor note: GetEnumerator() delegates to ((IEnumerable<Activity>)Snapshot()).GetEnumerator(), which allocates a new array and an enumerator object on every foreach. Since all test-level callers already use Snapshot() directly (and the class is only used in a test), this is not a real concern — just worth noting if the pattern is copied elsewhere.
2. Rendezvous timeout 15s → 60s — Correct rationale, but outer CTS mismatch
Bumping the RendezvousTimeout constant in Repro5700 and CrossKeyOverlap5700 from 15s to 60s is justified: the test harness runs the full [EngineTest=Pass] set, and under MaxDOP saturation on a slow runner 15s wasn't enough budget.
However: KeyedNotInParallelCrossPhaseTests wraps each inner test with a hard CancellationTokenSource(TimeSpan.FromSeconds(30)):
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
await RunTestsWithFilter("/*/*/Repro5700/*", [...], new RunOptions().WithForcefulCancellationToken(cts.Token));The inner rendezvous now allows up to 60s, but the outer engine-test wrapper cancels after 30s. On a slow runner where the rendezvous needs more than 30s, the subprocess gets force-cancelled and the engine test fails — not because the bug is back, but because the scheduling deadline fired before the rendezvous resolved. The outer CTS should be at least as wide as the inner rendezvous timeout (60s), with some headroom:
// Outer budget must exceed the inner RendezvousTimeout (60 s) to avoid
// the subprocess being cancelled before the rendezvous resolves.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(90));This is the one actionable issue in this commit.
Remaining Nit from Round 7 (unchanged)
The CloudShop timeout comment:
// Generous timeout so the assertion doesn't trip on a busy CI runner where
// RabbitMQ delivery + Worker processing can lag behind under concurrent load.
timeout: TimeSpan.FromSeconds(120),The cause is documented (keyed/Parallel concurrency increase), but there's still no follow-up path. A // TODO: investigate [NotInParallel] on this class or a dedicated CI step — see PR #5740 would give the next reader a breadcrumb. Low priority, not a blocker.
Verdict
The LockedActivityList fix is correct and the rendezvous timeout increase is justified. The outer 30s CTS vs inner 60s rendezvous mismatch in KeyedNotInParallelCrossPhaseTests is a real reliability risk that should be fixed before merge — bumping to 90s (or removing the outer CTS and relying solely on the inner rendezvous deadline) resolves it. Everything else from rounds 1–7 remains resolved.
Summary
[Test] T1and[Test, NotInParallel("k")] T2, T1 was expected to run alongside T2 (per docs: "Tests with no common key may still run concurrently"), but T1 always finished before T2 started.TestScheduler.ExecuteGroupedTestsAsyncawaited each grouping bucket sequentially — the unconstrainedParallelbucket drained fully before theKeyedNotInParallelbucket began. Now those two buckets run concurrently.ParallelGroups/ConstrainedParallelGroupsstill serialize cross-group, and the globalNotInParallelbucket still drains last (must run alone).Bugs/5700regression that asserts T1 observes T2 active during its execution, plus an engine-level wrapper test.Test plan
Repro5700passes with the fix; verified it fails without (assertion_test1ObservedTest2 > 0is 0 onmain).KeyedNotInParallelTests(12) still passes — keyed serialization untouched.ConstraintKeyStressTests(30) still passes — no deadlock under high-contention keys.KeyedNotInParallelCrossPhaseTestsengine wrapper passes (Reflection mode locally; AOT runs in CI).Closes the discussion.