Skip to content

fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests (#5700)#5740

Merged
thomhurst merged 6 commits intomainfrom
fix/5700-keyed-notinparallel-concurrency
Apr 26, 2026
Merged

fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests (#5700)#5740
thomhurst merged 6 commits intomainfrom
fix/5700-keyed-notinparallel-concurrency

Conversation

@thomhurst
Copy link
Copy Markdown
Owner

Summary

  • Discussion #5700: with [Test] T1 and [Test, NotInParallel("k")] T2, T1 was expected to run alongside T2 (per docs: "Tests with no common key may still run concurrently"), but T1 always finished before T2 started.
  • TestScheduler.ExecuteGroupedTestsAsync awaited each grouping bucket sequentially — the unconstrained Parallel bucket drained fully before the KeyedNotInParallel bucket began. Now those two buckets run concurrently. ParallelGroups/ConstrainedParallelGroups still serialize cross-group, and the global NotInParallel bucket still drains last (must run alone).
  • Adds a Bugs/5700 regression that asserts T1 observes T2 active during its execution, plus an engine-level wrapper test.

Test plan

  • Repro5700 passes with the fix; verified it fails without (assertion _test1ObservedTest2 > 0 is 0 on main).
  • KeyedNotInParallelTests (12) still passes — keyed serialization untouched.
  • ConstraintKeyStressTests (30) still passes — no deadlock under high-contention keys.
  • KeyedNotInParallelCrossPhaseTests engine wrapper passes (Reflection mode locally; AOT runs in CI).
  • Wall-clock evidence: regression class drops 6.4s → 4.3s after fix (T1=3s overlapping T2 serial=2s).

Closes the discussion.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 25, 2026

Not up to standards ⛔

🔴 Issues 2 minor

Alerts:
⚠ 2 issues (≤ 0 issues of at least minor severity)

Results:
2 new issues

Category Results
CodeStyle 2 minor

View in Codacy

🟢 Metrics 7 complexity

Metric Results
Complexity 7

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests (#5700)

The fix is correct and well-motivated. The scheduler previously treated Parallel and KeyedNotInParallel as sequential phases, which was an observable bug: an unconstrained [Test] would always drain fully before any [NotInParallel("key")] test started, even though the keyed constraint only governs tests sharing that key. The regression test clearly demonstrates the failure, and the wall-clock evidence (6.4s → 4.3s) is a nice bonus.

A few observations worth discussing:


1. Timing-based regression test is fragile

The Repro5700 test asserts _test1ObservedTest2 > 0 by relying on Test1 running for 3 s (60 × 50 ms) while Test2 runs twice in serial (2 × 1 s = 2 s). This works on a normally loaded machine, but:

  • On a heavily loaded CI runner, a slow OS scheduler might delay Test2's first invocation past Test1's completion, making the assertion flaky.
  • The test tracks a counter (_test1ObservedTest2 > 0) rather than a flag, which accumulates noise and doesn't add much beyond a simple bool _overlapped = false.

A more robust approach would be to use a SemaphoreSlim or TaskCompletionSource as a rendezvous: Test1 sets a signal once it is active, Test2 waits for it before sleeping, or vice-versa — proving overlap without depending on wall-clock timing at all. The static fields also need volatile or to be accessed exclusively through Interlocked to avoid data races on _test2ConcurrentViolations and _test1ObservedTest2 (they are read without Volatile.Read in AssertParallelismShape, though on current x86/ARM the memory model would likely not bite you in practice).


2. concurrentPhases.ToArray() allocation inside the hot path

WaitForTasksWithFailFastHandling already accepts IEnumerable<Task>, so the ToArray() call is only needed because the signature forces it. A small nit, but since this is called every time tests are scheduled you could either:

// Option A: pass the list directly
await WaitForTasksWithFailFastHandling(concurrentPhases, cancellationToken).ConfigureAwait(false);

…by changing the signature to IEnumerable<Task>, or

// Option B: use a fixed-size span/array from the start given the capacity is known (at most 2)
Task? parallelTask = null, keyedTask = null;

…and construct only what you need. The List<Task>(2) + ToArray() dance allocates twice for a maximum of two elements. A pair of nullable Task? fields would be zero-allocation here.


3. Asymmetric concurrentPhases vs tasks naming

The ConstrainedParallelGroups loop (line 226–237) uses the same pattern with a local tasks list and ToArray(). Now that ExecuteGroupedTestsAsync has two separate code sites doing the same thing, this is a good moment to extract a small helper:

private async Task RunConcurrentlyAsync(Task? a, Task? b, CancellationToken ct)
{
    if (a is null && b is null) return;
    if (a is null) { await b!.ConfigureAwait(false); return; }
    if (b is null) { await a.ConfigureAwait(false); return; }
    await WaitForTasksWithFailFastHandling([a, b], ct).ConfigureAwait(false);
}

This removes the List<Task> allocations at both call sites and makes the intent obvious.


4. Missing coverage for the symmetric case

The PR tests that an unconstrained test (Parallel bucket) can run alongside a KeyedNotInParallel test. But the symmetric scenario — a KeyedNotInParallel test from key A running alongside a KeyedNotInParallel test from key B — is not covered. Both end up in the same concurrentPhases bucket and are dispatched together to ConstraintKeyScheduler, which serializes only within the same key. A test asserting that two differently-keyed [NotInParallel] tests overlap each other would round out the coverage.


5. _test1Active field is tracked but never asserted

_test1Active is incremented/decremented in Test1_RunsAlongsideKeyedTest2 but is not checked in AssertParallelismShape. If the field was meant to be a guard (e.g., to detect if Test1 exited early), it should be asserted; otherwise it can be removed to reduce noise.


Summary

The core scheduling change is correct and the trade-offs (ParallelGroups still serialize cross-group, global NotInParallel still drains last) are sound. The main concerns are the fragility of the timing-based regression test and two small allocation opportunities in the hot path. None of these block the fix, but the test fragility in particular is worth addressing before this lands on a slow CI machine.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests

Summary

The fix is correct and well-motivated. Moving KeyedNotInParallel execution to run concurrently with the Parallel bucket (via Task.WhenAll through WaitForTasksWithFailFastHandling) properly reflects the semantics: a [NotInParallel("key")] test only conflicts with other tests sharing that key, so it should not be blocked by unrelated unconstrained tests draining first.

The two concurrent tasks share no mutable state — ExecuteTestsAsync uses Parallel.ForEachAsync with a global DOP cap, while ConstraintKeyScheduler.ExecuteTestsWithConstraintsAsync protects its own lockedKeys/waitingTestIndex under its own lockObject. The .AsTask() call on the returned ValueTask is also correct (required before adding to the task list).


Issues

1. Repro5700 is timing-sensitive — add [Retry(3)] (HIGH)

Test1 polls _test2Active every 50 ms for 3 seconds. The assertion _test1ObservedTest2 > 0 relies on Test2 starting its first invocation while Test1 is still running. On a loaded CI runner this margin could be violated. Looking at KeyedNotInParallelTests and NotInParallelMixedTests in the project — both carry [Retry(3)] for exactly this reason. Repro5700 has no [Retry] guard, making it a CI flakiness risk.

Better approach: Add [Retry(3)] at minimum. A structurally correct alternative is a rendezvous mechanism (SemaphoreSlim, ManualResetEventSlim, or TaskCompletionSource) so Test1 signals it is active and Test2 waits for that signal, proving overlap deterministically rather than by wall-clock coincidence.

2. Inconsistent Volatile.Read in AssertParallelismShape (MEDIUM)

The test carefully uses Volatile.Read(ref _test2Active) inside the polling loop (line 35), but reads _test1ObservedTest2 (line 68) and _test2ConcurrentViolations (line 67) in the [After(Class)] hook without Volatile.Read. On the .NET memory model the [After(Class)] happens-after all test tasks, so this is safe in practice, but it is formally inconsistent with the rest of the class. Worth fixing for correctness-by-inspection:

Assert.That(Volatile.Read(ref _test2ConcurrentViolations), Is.EqualTo(0), ...);
Assert.That(Volatile.Read(ref _test1ObservedTest2), Is.GreaterThan(0), ...);

3. Pre-existing tasks.ToArray() inconsistency at line 237 (LOW)

The new code at line ~198 correctly passes concurrentPhases (a List<Task>) directly to WaitForTasksWithFailFastHandling(IEnumerable<Task>, ...) — no unnecessary allocation. However, the unchanged ConstrainedParallelGroups block at line ~237 still does tasks.ToArray() before passing to the same method. The PR is a good opportunity to fix this asymmetry and remove the redundant copy:

// Before
await WaitForTasksWithFailFastHandling(tasks.ToArray(), cancellationToken);

// After
await WaitForTasksWithFailFastHandling(tasks, cancellationToken);

4. Missing cross-key [NotInParallel] concurrency test (LOW)

Repro5700 validates Parallel ↔ KeyedNotInParallel overlap. It does not verify that [NotInParallel("A")] tests run concurrently with [NotInParallel("B")] tests (different keys, no shared constraint). This is logically guaranteed by ConstraintKeyScheduler's design, but a targeted test would complete the cross-bucket regression coverage introduced by this PR.

5. Dynamic queue gap (pre-existing, worth noting)

ProcessDynamicTestQueueAsync handles Parallel and NotInParallel dynamic tests but silently ignores groupedDynamicTests.KeyedNotInParallel. This predates this PR, but since this PR introduces the concurrent-phase model for KeyedNotInParallel, it is a good time to flag this for a follow-up issue.


Previous Review

The prior claude[bot] review raised 5 observations. None have been addressed yet (same commit f2f6147). Note that observation #2 in that review (about ToArray() allocation) was incorrect as applied to the new code — the PR correctly avoids ToArray() in the new path.


Verdict

The core fix is correct and safe to merge. The primary blocker before landing is adding [Retry(3)] to Repro5700 to match the established pattern for timing-sensitive scheduler tests in this project. The other items are lower priority but worth addressing in this PR or in a follow-up.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: fix(engine): allow keyed NotInParallel tests to run alongside unconstrained tests (#5700)

Summary

The fix is correct, minimal, and well-tested. The root cause — sequential phase draining in ExecuteGroupedTestsAsync — is addressed cleanly by firing both the unconstrained and keyed-NotInParallel tasks concurrently and awaiting them together.


What's Good

Correct diagnosis and fix. The bug was that ExecuteGroupedTestsAsync fully awaited the Parallel bucket before starting KeyedNotInParallel. The refactor to RunPhasesConcurrentlyAsync is the minimal, right-sized fix.

Excellent test design. Using TaskCompletionSource rendezvous patterns in Repro5700 and CrossKeyOverlap5700 instead of wall-clock sleeps makes the tests deterministic on slow CI. Both tests fail fast and deterministically on the broken scheduler. This is exactly how concurrent scheduling tests should be written.

The RunPhasesConcurrentlyAsync helper is clean. The null-guard short-circuit paths avoid unnecessary Task.WhenAll allocation when one bucket is empty — a minor but correct performance consideration.


Issues Worth Addressing

1. Dynamic tests with KeyedNotInParallel are silently dropped — ProcessDynamicTestQueueAsync

File: TestScheduler.cs lines 274–284 and 305–313

ProcessDynamicTestQueueAsync only handles groupedDynamicTests.Parallel and groupedDynamicTests.NotInParallel:

if (groupedDynamicTests.Parallel.Length > 0)
    await ExecuteTestsAsync(groupedDynamicTests.Parallel, cancellationToken).ConfigureAwait(false);

if (groupedDynamicTests.NotInParallel.Length > 0)
    await ExecuteSequentiallyAsync(groupedDynamicTests.NotInParallel, cancellationToken).ConfigureAwait(false);

There's no handling for groupedDynamicTests.KeyedNotInParallel. Any dynamically-generated test with [NotInParallel("key")] would be silently skipped — it won't execute at all.

This is a pre-existing bug, not introduced by this PR. But since the PR is auditing the scheduling phases, it's worth surfacing: ProcessDynamicTestQueueAsync should either add the missing KeyedNotInParallel handling (mirroring ExecuteGroupedTestsAsync) or add a comment acknowledging the limitation.

Suggestion:

if (groupedDynamicTests.KeyedNotInParallel.Length > 0)
{
    await _constraintKeyScheduler
        .ExecuteTestsWithConstraintsAsync(groupedDynamicTests.KeyedNotInParallel, cancellationToken)
        .ConfigureAwait(false);
}

2. RunPhasesConcurrentlyAsync — misleading cancellationToken parameter

File: TestScheduler.cs lines 396–401

private Task RunPhasesConcurrentlyAsync(Task? a, Task? b, CancellationToken cancellationToken)
{
    if (a is null) return b ?? Task.CompletedTask;
    if (b is null) return a;
    return WaitForTasksWithFailFastHandling([a, b], cancellationToken);
}

The fast-path returns (a is null / b is null) don't use cancellationToken at all — the token is already baked into the tasks by the time they arrive. Inside WaitForTasksWithFailFastHandling, the token is only checked after an exception to detect fail-fast mode, not to cancel the wait itself.

This is functionally fine, but the signature implies the token can influence what happens — it can't. Either drop the parameter (and pass the token only into WaitForTasksWithFailFastHandling at the call site), or add a brief comment explaining why it's only used for fail-fast detection:

// cancellationToken is checked only for fail-fast exception propagation inside
// WaitForTasksWithFailFastHandling; tasks are already started and carry their own token.

3. ParallelGroups sequential loop — worth documenting

File: TestScheduler.cs lines 198–215

The ParallelGroups loop still processes each group sequentially (await ExecuteTestsAsync(...)). This is intentional (cross-group exclusion), but the intent differs from the two newly-concurrent phases above it. A short comment here would prevent a future reader from assuming it's another instance of the pre-fix bug:

// Each parallel group runs to completion before the next — tests within a group
// are bucketed together precisely because they must not overlap.
foreach (var group in groupedTests.ParallelGroups) { ... }

Minor Observation

CloudShop timeout bump (25s → 60s) — unrelated to the scheduling fix. Bumping timeouts to suppress flakiness without understanding the root cause tends to mask real problems. If Step3_Worker_Fulfills_Order was intermittently hitting 25s, that may indicate an ordering/async issue in the test itself rather than a slow CI box. Worth investigating separately rather than silently widening the window.


Verdict

The core fix is solid. The two real concerns are:

  1. ProcessDynamicTestQueueAsync silently drops KeyedNotInParallel dynamic tests (pre-existing gap, worth fixing in a follow-up if not now).
  2. RunPhasesConcurrentlyAsync CancellationToken semantics are slightly misleading (low risk, easy to clarify with a comment).

The rendezvous-based tests are the right way to validate concurrent scheduling behavior. Approved conceptually — addressing item 1 would make this a complete fix rather than a partial one.

thomhurst added a commit that referenced this pull request Apr 25, 2026
Address PR #5740 review item: ProcessDynamicTestQueueAsync only handled
Parallel and global NotInParallel — dynamically-added tests with
[NotInParallel(key)], [ParallelGroup(...)], or constrained-parallel-group
attributes were silently dropped.

Extract ExecuteAllPhasesAsync from ExecuteGroupedTestsAsync and call it from
both the main session path and the dynamic batch loop, so any future bucket
added to GroupedTests is automatically respected on both paths.

Also: clarify cancellationToken role in RunPhasesConcurrentlyAsync and
document the intentional sequential ParallelGroups loop.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Round 4 (commit 749b58b)

What's Been Fixed (Previous Reviews Addressed)

All major concerns from the three prior claude[bot] reviews have been resolved:

  • Timing-based test fragilityRepro5700 now uses TaskCompletionSource rendezvous instead of wall-clock polling. This is the correct approach.
  • Missing cross-key coverageCrossKeyOverlap5700 covers the symmetric [NotInParallel("A")][NotInParallel("B")] scenario.
  • Dynamic queue gapExecuteDynamicBatchAsync routes dynamic tests through ExecuteAllPhasesAsync, so dynamically-generated [NotInParallel("key")] tests now execute correctly instead of being silently dropped.
  • List<Task> + ToArray() allocation — eliminated at both call sites by RunPhasesConcurrentlyAsync.
  • RunPhasesConcurrentlyAsync cancellation token comment — added; explains why the fast-path returns don't inspect the token.
  • ParallelGroups sequential-by-design comment — added.

Remaining Concern

CloudShop timeout bump (25s → 60s) — no root cause identified

File: examples/CloudShop/CloudShop.Tests/Tests/Orders/OrderWorkflowTests.cs

The comment says "Generous timeout so the assertion doesn't trip on a busy CI runner where RabbitMQ delivery + Worker processing can lag behind under concurrent load." But the scheduling change here introduces more concurrency, not less — unconstrained and keyed-NotInParallel tests now run simultaneously where they previously ran sequentially. If Step3_Worker_Fulfills_Order is in a group with tests that now run concurrently against it, that could explain a real regression; but the comment doesn't establish that causal chain.

Three possible cases:

  1. Pre-existing flakiness, unrelated to this PR — the bump is papering over a known fragility.
  2. Real regression — the concurrent phase execution causes resource contention in CI (e.g., shared Docker/RabbitMQ) that inflates processing time.
  3. Coincidence — slow CI run happened to reveal an edge case during review.

Why this matters: masking CI flakiness with a generous timeout is a pattern that tends to compound. 60s for a polling assertion is very wide. If this is a real regression from the scheduling change, it should be understood and addressed. If it's pre-existing, it should be filed as a separate issue rather than silently widened.

Suggestion: Either trace why this test now needs 60s (and document it in the PR), or revert the timeout and open a separate issue for the flakiness. A comment in the test explaining the failure mode would help future readers.


Overall Assessment

The core scheduling fix is correct, well-reasoned, and now backed by deterministic regression tests — the right shape for a scheduling correctness bug. The extraction of ExecuteAllPhasesAsync and ExecuteDynamicBatchAsync makes the phase pipeline reusable for dynamic tests without duplication. The RunPhasesConcurrentlyAsync helper is clean and allocation-conscious.

The CloudShop timeout is the one unresolved item. It's not a blocker if the cause is understood and accepted, but it should be explicitly acknowledged rather than left ambiguous.

@thomhurst thomhurst enabled auto-merge (squash) April 25, 2026 10:16
thomhurst added a commit that referenced this pull request Apr 25, 2026
OptOut_DoesNotTag_AspNetCoreSpans regressed under #5740's keyed/Parallel
overlap. Each TracerProvider's AddAspNetCoreInstrumentation subscribes to
the process-global Microsoft.AspNetCore ActivitySource, so a parallel
auto-OTel WebApplicationTest's correlation processor stamps spans the
OptOut exporter then captures. Keyed serialization could only enumerate
sibling auto-wire tests; switch OptOut to global [NotInParallel] until
per-factory filtering lands.

Also bump CloudShop Step3 timeout to 120s — Worker fulfillment via
RabbitMQ lags further under the new keyed/Parallel concurrency.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Round 5 (commit efbf7b1)

Previous Review Items — All Resolved

All concerns from rounds 1–4 have been addressed:

  • Timing-based test fragilityRepro5700 uses TaskCompletionSource rendezvous. Deterministic.
  • Missing cross-key coverageCrossKeyOverlap5700 covers [NotInParallel("A")][NotInParallel("B")] overlap.
  • Dynamic queue gapExecuteDynamicBatchAsync routes dynamic tests through ExecuteAllPhasesAsync, so keyed dynamic tests no longer silently drop.
  • List<Task> + ToArray() allocation — eliminated by RunPhasesConcurrentlyAsync.
  • CancellationToken semantics comment — added and explains the fast-path correctly.
  • ParallelGroups sequential-by-design comment — added.

New in this commit (efbf7b1)

OTel Isolation Redesign — Correct

The two OTel test classes have asymmetric assertion semantics:

  • AutoConfigureOpenTelemetryTests asserts tag presence — a foreign span that also carries a TestId tag does not produce a false negative. No isolation needed. ✅
  • AutoConfigureOpenTelemetryOptOutTests asserts tag absence — a parallel factory's correlation processor stamps spans before they reach this exporter, producing false negatives. Global [NotInParallel] is the right constraint until per-factory filtering lands. ✅

The docstrings explain both decisions clearly. The global drain for OptOut is a performance tax but unavoidable given the process-global ActivitySource architecture. Good call.

CloudShop Timeout: 25s → 60s → 120s — Still Unresolved (and getting worse)

File: examples/CloudShop/CloudShop.Tests/Tests/Orders/OrderWorkflowTests.cs

The commit message at least names the cause: "Worker fulfillment via RabbitMQ lags further under the new keyed/Parallel concurrency." The causal chain is now documented, which is progress from round 4.

However, the symptom is worsening: the timeout has inflated 4.8× across this PR. A 120-second polling window for a RabbitMQ delivery + Worker processing cycle is a strong signal that the increased concurrency is saturating shared CI resources (Docker, RabbitMQ throughput, CPU). This has two practical risks:

  1. CI suite wall-clock time — a test that can legitimately wait 120 seconds adds 2 minutes to every CI run where the slow path is hit, multiplied across parallel workers.
  2. Masked regression — if the test regresses for a real reason (e.g., a broken message handler), the first symptom is a 120-second hang rather than a quick fail.

Better approach: Investigate whether the CloudShop integration tests should be isolated from the scheduler-level concurrency increase. Options:

  • Add [NotInParallel] (or a shared key) to the CloudShop test class so it does not compete with CPU/IO-heavy suite tests.
  • Run the CloudShop E2E suite as a separate CI step with its own resource budget.
  • At minimum, document in the test file that the 120s is a scheduling-contention accommodation and open a tracking issue.

120 seconds is too wide to leave undocumented. The current comment only says "busy CI runner" — it should name the root cause (keyed/Parallel concurrency increase) and link the PR.


Verdict

The scheduling fix is correct, well-tested, and all structural concerns from prior reviews are resolved. The OTel isolation logic is sound. The one remaining concern is the CloudShop timeout, which has been widened further rather than root-caused. It is not a correctness blocker, but I'd recommend either isolating the CloudShop tests from the concurrent scheduler or opening a tracking issue before this merges so the timeout doesn't silently grow further.

thomhurst added a commit that referenced this pull request Apr 26, 2026
Address PR #5740 review item: ProcessDynamicTestQueueAsync only handled
Parallel and global NotInParallel — dynamically-added tests with
[NotInParallel(key)], [ParallelGroup(...)], or constrained-parallel-group
attributes were silently dropped.

Extract ExecuteAllPhasesAsync from ExecuteGroupedTestsAsync and call it from
both the main session path and the dynamic batch loop, so any future bucket
added to GroupedTests is automatically respected on both paths.

Also: clarify cancellationToken role in RunPhasesConcurrentlyAsync and
document the intentional sequential ParallelGroups loop.
thomhurst added a commit that referenced this pull request Apr 26, 2026
OptOut_DoesNotTag_AspNetCoreSpans regressed under #5740's keyed/Parallel
overlap. Each TracerProvider's AddAspNetCoreInstrumentation subscribes to
the process-global Microsoft.AspNetCore ActivitySource, so a parallel
auto-OTel WebApplicationTest's correlation processor stamps spans the
OptOut exporter then captures. Keyed serialization could only enumerate
sibling auto-wire tests; switch OptOut to global [NotInParallel] until
per-factory filtering lands.

Also bump CloudShop Step3 timeout to 120s — Worker fulfillment via
RabbitMQ lags further under the new keyed/Parallel concurrency.
@thomhurst thomhurst force-pushed the fix/5700-keyed-notinparallel-concurrency branch from efbf7b1 to 8ee609f Compare April 26, 2026 11:12
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Round 6 (commit efbf7b1 → current)

All concerns from rounds 1–5 have been addressed. This is a well-executed fix.

Core fix (TestScheduler.cs) — ✅

The architectural approach is correct: Parallel and KeyedNotInParallel buckets now start concurrently, ParallelGroups still serialize cross-group, and global NotInParallel still drains last. The phase ordering matches the documented semantics in parallelism.md.

RunPhasesConcurrentlyAsync is clean — null-guard short-circuits avoid allocating a Task[] when one bucket is empty, and the token-forwarding comment accurately describes why the parameter exists. No issues here.

Dynamic test routing (ExecuteDynamicBatchAsync) — ✅

Routing dynamic tests through ExecuteAllPhasesAsync was the right call. The pre-existing silent-drop for KeyedNotInParallel dynamic tests is now fixed as a side-effect. The recursive structure is safe: nested ExecuteAllPhasesAsync calls do not touch _dynamicTestQueue.Complete(), so the queue lifecycle remains owned by ExecuteGroupedTestsAsync.

Tests (Repro5700, CrossKeyOverlap5700) — ✅

Both rendezvous tests are deterministic and will fail fast on the bug rather than depending on wall-clock timing. AssertKeyedSerialization correctly verifies the constraint still serializes within its key. The engine-level wrapper tests in KeyedNotInParallelCrossPhaseTests give CI-level regression coverage.

OTel isolation redesign — ✅

The asymmetric handling (presence assertion vs. absence assertion) is architecturally correct. Removing the keyed constraint from AutoConfigureOpenTelemetryTests and promoting AutoConfigureOpenTelemetryOptOutTests to global [NotInParallel] is the right trade-off given the process-global ActivityListener model.

CloudShop timeout (25 s → 120 s) — remaining concern

The comment now explains the proximate cause (RabbitMQ delivery + Worker processing lag under increased concurrent load). That is appreciated. However, a 4.8× timeout inflation is still a yellow flag worth tracking:

  • The increased concurrency from this scheduler fix is by design — CloudShop's integration tests are now exposed to more parallelism. If 120 s is needed to survive that, it suggests either (a) the Worker/RabbitMQ stack has a saturation point that CI regularly hits, or (b) the test and the service under test were implicitly depending on the old serialized scheduler.
  • A test with a 120 s ceiling that times out on CI doesn't give a useful signal — it just eats 2 minutes before failing.

Suggestion: open a follow-up issue (or add a // TODO(#XXXX) to the comment) to investigate whether Step3_Worker_Fulfills_Order should be isolated from the concurrent scheduler (e.g. via [NotInParallel] on that test or its class) or whether the Worker/RabbitMQ health-check polling interval should be tightened. The comment as-is explains the symptom but doesn't give the next person a path forward.

This is a low-priority blocker — the fix itself is correct and safe to merge. The CloudShop timeout is a quality-of-life follow-up, not a correctness issue.

Verdict: ✅ Ready to merge, with the CloudShop timeout tracked as a follow-up.

…rently

Discussion #5700: with `[Test] T1` and `[Test, NotInParallel("k")] T2`, T1 was
expected to run alongside T2 (per docs/execution/parallelism.md: "Tests with
no common key may still run concurrently"), but in practice T1 always finished
before T2 started.

Cause: TestScheduler.ExecuteGroupedTestsAsync awaited each grouping bucket in
sequence. The Parallel bucket (no constraints) drained fully before the
KeyedNotInParallel bucket began, so unconstrained tests never overlapped keyed
ones. Run those two buckets concurrently — they share no synchronization state.
ParallelGroups still serialize (cross-group exclusion) and the global
NotInParallel bucket still drains last (must run alone).

Adds a regression test (Bugs/5700) that asserts Test1 observes Test2 active
during its execution, plus an engine-level wrapper test.
…overage

- Repro5700 now uses TaskCompletionSource rendezvous instead of a 3s wall-clock
  sample window so the assertion is deterministic on slow CI runners. Test1
  signals once it is live and waits for any Test2 invocation; each Test2
  invocation does the inverse. Either side timing out flags the bug.
- New CrossKeyOverlap5700 fixture covers the symmetric scenario (two
  [NotInParallel] tests with different keys must overlap).
- Extract `RunPhasesConcurrentlyAsync(Task?, Task?, CancellationToken)` helper
  in TestScheduler. Unifies the new Parallel + KeyedNotInParallel call site
  with the existing ConstrainedParallelGroups loop and removes the per-phase
  `List<Task>` + `ToArray()` dance (zero allocation in the common single-phase
  case).
- Bump CloudShop OrderWorkflow Step3 WaitsFor timeout from 25s → 60s. The
  worker eventual-consistency window was already borderline; under the new
  cross-phase concurrency the RabbitMQ + worker pipeline shares CPU with
  unconstrained tests and 25s no longer fits a busy CI runner.
Address PR #5740 review item: ProcessDynamicTestQueueAsync only handled
Parallel and global NotInParallel — dynamically-added tests with
[NotInParallel(key)], [ParallelGroup(...)], or constrained-parallel-group
attributes were silently dropped.

Extract ExecuteAllPhasesAsync from ExecuteGroupedTestsAsync and call it from
both the main session path and the dynamic batch loop, so any future bucket
added to GroupedTests is automatically respected on both paths.

Also: clarify cancellationToken role in RunPhasesConcurrentlyAsync and
document the intentional sequential ParallelGroups loop.
OptOut_DoesNotTag_AspNetCoreSpans regressed under #5740's keyed/Parallel
overlap. Each TracerProvider's AddAspNetCoreInstrumentation subscribes to
the process-global Microsoft.AspNetCore ActivitySource, so a parallel
auto-OTel WebApplicationTest's correlation processor stamps spans the
OptOut exporter then captures. Keyed serialization could only enumerate
sibling auto-wire tests; switch OptOut to global [NotInParallel] until
per-factory filtering lands.

Also bump CloudShop Step3 timeout to 120s — Worker fulfillment via
RabbitMQ lags further under the new keyed/Parallel concurrency.
@thomhurst thomhurst force-pushed the fix/5700-keyed-notinparallel-concurrency branch from 8ee609f to 670d82c Compare April 26, 2026 11:36
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 7 — Approval

All concerns from rounds 1–6 are resolved. The implementation is correct, well-tested, and production-ready.

What's right

Core fix (TestScheduler.cs) — Clean and correct. ExecuteAllPhasesAsync properly runs the Parallel and KeyedNotInParallel buckets concurrently while keeping ParallelGroups cross-group serialization and global NotInParallel drain-last semantics intact. The code comment precisely documents why sequential ordering was the bug (#5700). ✓

RunPhasesConcurrentlyAsync — Null-guard short-circuits avoid Task[] allocation on common single-phase paths. The comment explaining the CancellationToken forwarding intent is exactly the kind of non-obvious note worth keeping. ✓

ExecuteDynamicBatchAsync — Routing dynamic tests through ExecuteAllPhasesAsync instead of two bare if-blocks fixes the pre-existing silent-drop for [NotInParallel(key)] dynamically-generated tests. Both call sites now share one implementation. ✓

Regression testsRepro5700 uses a TaskCompletionSource rendezvous (no wall-clock polling), CrossKeyOverlap5700 covers the symmetric different-keys scenario, and KeyedNotInParallelCrossPhaseTests wraps both at the engine level with 30-second hard timeouts. Deterministic and correct. ✓

OTel isolation (AutoConfigureOpenTelemetryTests) — The rationale split is accurate: presence-assertion class needs no key, absence-assertion class correctly uses keyless [NotInParallel] with a thorough doc-comment explaining the process-global ActivitySource architecture. ✓

One remaining nit (non-blocking)

The CloudShop timeout comment explains the cause well:

// Generous timeout so the assertion doesn't trip on a busy CI runner where
// RabbitMQ delivery + Worker processing can lag behind under concurrent load.
timeout: TimeSpan.FromSeconds(120),

But it doesn't give the next reader a path forward — a 4.8× inflation from 25 s → 120 s is worth tracking. Adding a // TODO(#XXXX): investigate whether [NotInParallel] on this class (or a dedicated CI step) can bring the timeout back down pointing to a follow-up issue would close the loop. This is a minor nit and should not block the merge.


Verdict: Approve. The bug fix is correct, the tests are deterministic, and all prior review items are resolved.

- AutoConfigureOpenTelemetryTests: OTel exporter Adds to _exported on a
  background batch thread without locking, so the test's poll loop racing
  on List<Activity> threw "Collection was modified". Wrap in a small
  LockedActivityList (ICollection<Activity>) and iterate Snapshot() copies.
- Repro5700 / CrossKeyOverlap5700: subprocess runs every [EngineTest=Pass]
  test (hundreds) so the 15s rendezvous deadline races MaxDOP saturation
  on slow macOS runners. Bump to 60s — the bug being guarded would still
  manifest as the rendezvous never resolving, not as a 60s delay.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Round 8 (commit a11e766)

Previous Review Status

Round 7 posted an approval covering commits up through . Commit (fix(tests): unbreak CI on PR #5740) landed after that approval and hasn't had a dedicated pass. This review covers only the delta.


New in commit a11e766

1. LockedActivityList — Correct

The fix for the Collection was modified race is sound. OpenTelemetry's in-memory exporter calls ICollection<T>.Add on a background batch-processor thread with no synchronisation guarantee, so a plain List<Activity> racy with the test's poll loop was always unsafe. Wrapping every mutation in lock (_items) and exposing a Snapshot() copy for iteration is the minimal, correct solution.

One minor note: GetEnumerator() delegates to ((IEnumerable<Activity>)Snapshot()).GetEnumerator(), which allocates a new array and an enumerator object on every foreach. Since all test-level callers already use Snapshot() directly (and the class is only used in a test), this is not a real concern — just worth noting if the pattern is copied elsewhere.

2. Rendezvous timeout 15s → 60s — Correct rationale, but outer CTS mismatch

Bumping the RendezvousTimeout constant in Repro5700 and CrossKeyOverlap5700 from 15s to 60s is justified: the test harness runs the full [EngineTest=Pass] set, and under MaxDOP saturation on a slow runner 15s wasn't enough budget.

However: KeyedNotInParallelCrossPhaseTests wraps each inner test with a hard CancellationTokenSource(TimeSpan.FromSeconds(30)):

using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
await RunTestsWithFilter("/*/*/Repro5700/*", [...], new RunOptions().WithForcefulCancellationToken(cts.Token));

The inner rendezvous now allows up to 60s, but the outer engine-test wrapper cancels after 30s. On a slow runner where the rendezvous needs more than 30s, the subprocess gets force-cancelled and the engine test fails — not because the bug is back, but because the scheduling deadline fired before the rendezvous resolved. The outer CTS should be at least as wide as the inner rendezvous timeout (60s), with some headroom:

// Outer budget must exceed the inner RendezvousTimeout (60 s) to avoid
// the subprocess being cancelled before the rendezvous resolves.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(90));

This is the one actionable issue in this commit.


Remaining Nit from Round 7 (unchanged)

The CloudShop timeout comment:

// Generous timeout so the assertion doesn't trip on a busy CI runner where
// RabbitMQ delivery + Worker processing can lag behind under concurrent load.
timeout: TimeSpan.FromSeconds(120),

The cause is documented (keyed/Parallel concurrency increase), but there's still no follow-up path. A // TODO: investigate [NotInParallel] on this class or a dedicated CI step — see PR #5740 would give the next reader a breadcrumb. Low priority, not a blocker.


Verdict

The LockedActivityList fix is correct and the rendezvous timeout increase is justified. The outer 30s CTS vs inner 60s rendezvous mismatch in KeyedNotInParallelCrossPhaseTests is a real reliability risk that should be fixed before merge — bumping to 90s (or removing the outer CTS and relying solely on the inner rendezvous deadline) resolves it. Everything else from rounds 1–7 remains resolved.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Round 8 (commit a11e766)

Previous Review Status

Round 7 posted an approval covering commits up through 670d82c. Commit a11e766 (fix(tests): unbreak CI on PR #5740) landed after that approval and hasn't had a dedicated pass. This review covers only the delta.


New in commit a11e766

1. LockedActivityList — Correct

The fix for the Collection was modified race is sound. OpenTelemetry's in-memory exporter calls ICollection<T>.Add on a background batch-processor thread with no synchronisation guarantee, so a plain List<Activity> racing with the test's poll loop was always unsafe. Wrapping every mutation in lock (_items) and exposing a Snapshot() copy for iteration is the minimal, correct solution.

One minor note: GetEnumerator() delegates to ((IEnumerable<Activity>)Snapshot()).GetEnumerator(), which allocates a new array and an enumerator object on every foreach. Since all test-level callers already use Snapshot() directly (and the class is only used in a test), this is not a real concern — just worth noting if the pattern is copied elsewhere.

2. Rendezvous timeout 15s → 60s — Correct rationale, but outer CTS mismatch

Bumping the RendezvousTimeout constant in Repro5700 and CrossKeyOverlap5700 from 15s to 60s is justified: the test harness runs the full [EngineTest=Pass] set, and under MaxDOP saturation on a slow runner 15s wasn't enough budget.

However: KeyedNotInParallelCrossPhaseTests wraps each inner test with a hard CancellationTokenSource(TimeSpan.FromSeconds(30)):

using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
await RunTestsWithFilter("/*/*/Repro5700/*", [...], new RunOptions().WithForcefulCancellationToken(cts.Token));

The inner rendezvous now allows up to 60s, but the outer engine-test wrapper cancels after 30s. On a slow runner where the rendezvous needs more than 30s, the subprocess gets force-cancelled and the engine test fails — not because the bug is back, but because the scheduling deadline fired before the rendezvous resolved. The outer CTS should be at least as wide as the inner rendezvous timeout (60s), with some headroom:

// Outer budget must exceed the inner RendezvousTimeout (60 s) to avoid
// the subprocess being cancelled before the rendezvous resolves.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(90));

This is the one actionable issue in this commit.


Remaining Nit from Round 7 (unchanged)

The CloudShop timeout comment:

// Generous timeout so the assertion doesn't trip on a busy CI runner where
// RabbitMQ delivery + Worker processing can lag behind under concurrent load.
timeout: TimeSpan.FromSeconds(120),

The cause is documented (keyed/Parallel concurrency increase), but there's still no follow-up path. A // TODO: investigate [NotInParallel] on this class or a dedicated CI step — see PR #5740 would give the next reader a breadcrumb. Low priority, not a blocker.


Verdict

The LockedActivityList fix is correct and the rendezvous timeout increase is justified. The outer 30s CTS vs inner 60s rendezvous mismatch in KeyedNotInParallelCrossPhaseTests is a real reliability risk that should be fixed before merge — bumping to 90s (or removing the outer CTS and relying solely on the inner rendezvous deadline) resolves it. Everything else from rounds 1–7 remains resolved.

@thomhurst thomhurst merged commit 25ebc55 into main Apr 26, 2026
13 of 16 checks passed
@thomhurst thomhurst deleted the fix/5700-keyed-notinparallel-concurrency branch April 26, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant