ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846)#5852
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n introduced by #5617 - Add `Mode != AllVersionsAndDeletes` guard to the StartTime backfill in ChangeFeedProcessorCore.StartAsync() so AVAD push processors no longer issue an explicit StartTime on a null-continuation lease (which the AVAD endpoint 400s). - Preserve the existing LatestVersion backfill behavior introduced by PR #5617 (the fix for #5268). - Add unit tests AC1 (AVAD-negative), AC7 (explicit-LV positive, symmetric companion to AC1), AC8 (AVAD + empty-continuation lease for SE-7 upgrade path). - Add emulator regression test AC5 in CFP/AllVersionsAndDeletes/. - Add changelog entry under Unreleased. Fixes #5846 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- F2 (minor): Adds Known Issues entry pointing at #5846/#5852 to the changelog Known issues table so customers on 3.59.0 / 3.60.0-preview.0 see the cold-start AVAD regression and its workaround. - F3 (suggestion): Widens AC7 timing tolerance from 2s to 5s in ChangeFeedProcessorCoreTests.cs to harden against CI flake under parallel load. The assertion still defends the spec intent (StartTime ~ UtcNow - 1s). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run dotnet-v3-ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…cfp-avad-cold-start-5846 # Conflicts: # changelog.md
|
Is this not covered by the PR-5825 from @ananth7592 ? If so - let's close this as duplicate ? |
|
@kundadebdatta thanks for flagging — the chronology actually runs the other way. #5852 was opened on 2026-05-11 with the AVAD I'd prefer to keep #5852 and have #5825 drop the duplicated
Happy to coordinate with @ananth7592 on rebasing #5825 onto #5852 (or just dropping the |
There was a problem hiding this comment.
Maybe mid-term - what are your thoughts on issuing a changefeed from now to populate continuation for a given partition at the time of lease creation - might help with AVAD cases where document drain start is pinned to lease creation start? Thoughts @xinlian12 @FabianMeiswinkel @kushagraThapar
|
Thanks @jeet1995 for the approval and the forward-looking thought — I really like this direction. A lease-time
Tradeoffs worth flagging for @xinlian12 / @FabianMeiswinkel / @kushagraThapar:
I'd propose tracking this as a follow-up work item — it naturally subsumes reviewer question #3. Happy to open the issue when this hotfix merges so customers on 3.59.0 / 3.60.0-preview.0 aren't blocked on the mid-term design. |
…cfp-avad-cold-start-5846 # Conflicts: # .gitignore # changelog.md
Conflicts resolved: - Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/ChangeFeedProcessorCore.cs: Adopted main's defensive structure from #5825 (explicit AVAD throws for StartFromBeginning/StartTime plus the shouldAnchorStartTime boolean). The Mode != AllVersionsAndDeletes guard ΓÇö the core fix from #5852 ΓÇö is preserved. Extended the comment to cite both #5825 and #5846 so future readers see the LSN-based rationale and the original cold-start regression context. - changelog.md: Kept both Unreleased Bugs Fixed entries (#5852 cold-start regression and #5870 hedging StackOverflow). Validated by building Microsoft.Azure.Cosmos.Tests and running the ChangeFeedProcessorCoreTests filter; all 20 tests pass, including both the PR's regression tests (AC1/AC7/AC8) and main's defensive throw tests added in #5825. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kushagraThapar
left a comment
There was a problem hiding this comment.
First — thank you for the very thorough comment expansion in Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/ChangeFeedProcessorCore.cs:79-87. The expanded "Do not drop the mode guard without re-validating both #5268 and #5846" admonition, with both upstream and downstream PR citations inline, is exactly the kind of in-source documentation the recurring CFP regression chain (#105 → #158 → #2491 → #5807 → #5617 → #5846) has historically lacked. That's the most durable artifact in this diff and I'd love to see this pattern applied elsewhere in the SDK going forward.
I have five things below I'd love your thoughts on — none are blocking, all are framed as questions, but I think they're worth working through together before merge. I'm intentionally not raising the minor / nit-level items I noticed (changelog phrasing polish, two propose-design-doc-extension follow-ups for cosmosdb-design-docs/12-change-feed.md, the stale mergeable: CONFLICTING GitHub flag, and a tangential Java-team consideration) so this comment stays focused.
Major findings
1. PR description / title divergence vs actual diff
Could you please help me understand the framing of the PR description? The Summary section states "The fix is a single-line addition to the existing backfill condition" and shows a code block with the new Mode != ChangeFeedMode.AllVersionsAndDeletes clause — but that clause is already in origin/main as of #5825 (commit d044d4c1f, merged 2026-05-15), which I reviewed and approved. I verified by directly comparing gh api /repos/.../contents/...?ref=main against ?ref=users/ntripician/fix-cfp-avad-cold-start-5846:
$ git show origin/main:.../ChangeFeedProcessorCore.cs | sed -n '88,92p'
bool shouldAnchorStartTime =
!this.changeFeedProcessorOptions.StartFromBeginning
&& this.changeFeedProcessorOptions.StartTime == null
&& string.IsNullOrEmpty(this.changeFeedProcessorOptions.StartContinuation)
&& this.changeFeedProcessorOptions.Mode != ChangeFeedMode.AllVersionsAndDeletes;
The actual ChangeFeedProcessorCore.cs net diff for this PR is +7/-3 lines — all in the comment block. I saw your earlier reply to @kundadebdatta acknowledging the chronology (this PR opened 2026-05-11, #5825 picked the same carve-out up on 2026-05-12), and I agree this PR has genuine value as a docs + test-coverage + customer-changelog layer over the already-shipped fix. Could you please consider updating the PR description Summary and title to reflect that reality? Something like "This PR's scope is (a) defensive comment hardening citing both #5268 and #5846, (b) targeted regression test coverage (AC1, AC5, AC7, AC8) above what #5825 carried, and (c) a customer-facing Bugs Fixed + Known Issues changelog entry for #5846 on shipped 3.59.0 / 3.60.0-preview.0. The functional guard itself landed via #5825" — and a title along the lines of ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846)?
2. AC1 unit test appears to duplicate existing master test (post-rebase collision risk)
In Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/ChangeFeed/ChangeFeedProcessorCoreTests.cs, the new StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode test (AC1, around line 272 on the head branch) appears functionally identical to the pre-existing StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes at line 444 (added by #5825). Same Mock<DocumentServiceLeaseStore> / Mock<DocumentServiceLeaseContainer> / Mock<DocumentServiceLeaseStoreManager> setup, same Mode = AllVersionsAndDeletes options, same Assert.IsNull(options.StartTime). Verified by grep:
$ grep -n "StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes" .../ChangeFeedProcessorCoreTests.cs
276: public async Task StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode()
444: public async Task StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes()
$ git show origin/main:.../ChangeFeedProcessorCoreTests.cs | grep -n "StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes"
273: public async Task StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes()
Rebase-collision risk: once you rebase to resolve the mergeable: CONFLICTING state, both tests will land in the same file with effectively identical bodies — future readers will see two methods asserting the same thing back-to-back and won't know which is canonical. Worth deduplicating before rebase rather than after. Could you please help me understand what new coverage AC1 adds over the existing test? If the intent is to cite #5846 as an additional anti-refactor reference, would it be lighter-weight to either (a) drop AC1 and expand the leading comment / assertion-failure message on the existing test to cite both #5617 and #5846, or (b) consolidate both into a single [DataTestMethod] parameterized by rationale tag (e.g., [DataRow("#5268-AC1-defense")] / [DataRow("#5825-original")]) so the dual-fence intent survives without two near-identical method bodies? Either path would also resolve the merge conflict on this file in passing.
3. AC8 does not exercise the code path it claims to defend
In the same test file, AC8 (StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode_WithEmptyContinuationLease, around line 388) iterates over null and string.Empty for the per-lease DocumentServiceLeaseCore.ContinuationToken, asserting options.StartTime stays null. But the shouldAnchorStartTime decision at ChangeFeedProcessorCore.cs:88-92 reads this.changeFeedProcessorOptions.StartContinuation — an options-level field, not the per-lease ContinuationToken:
$ grep -n "StartContinuation" \
.../ChangeFeedProcessorCore.cs \
.../Configuration/ChangeFeedProcessorOptions.cs
.../ChangeFeedProcessorCore.cs:91: && string.IsNullOrEmpty(this.changeFeedProcessorOptions.StartContinuation)
.../Configuration/ChangeFeedProcessorOptions.cs:46: public string StartContinuation { get; set; }
Both rows of the AC8 loop hit the same code path AC1 already exercises, so the lease-mock variations are functionally inert. The PR description positions AC8 as defending the SE-7 upgrade-from-buggy-SDK lease state, but the assertion doesn't depend on the lease state being set up. I am curious to learn what specific code path AC8's ContinuationToken = null vs "" rows are intended to exercise. If the intent is to defend the SE-7 upgrade-path behavior, would it make sense to either (a) drop AC8 (covered by AC1 functionally + AC5 end-to-end via the emulator), or (b) keep AC8 but add an assertion against the materialized ChangeFeedStartFrom (via partition initializer) so the lease ContinuationToken values actually affect what's tested?
4. AC7 regression-fence reliability — tolerance shape may flake under CI load
For StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit (AC7, around line 370 in the test file), the assertion captures expectedApprox = DateTime.UtcNow.AddSeconds(-1) before StartAsync runs, then compares against the post-StartAsync options.StartTime with a 5-second tolerance:
DateTime expectedApprox = DateTime.UtcNow.AddSeconds(-1);
await processor.StartAsync();
Assert.IsTrue(
Math.Abs((expectedApprox - options.StartTime.Value).TotalSeconds) < 5,
...);Under heavy CI load (mock-heavy harness, lease-store mock initialization, partition manager warm-up), StartAsync could plausibly take >5s, producing a confusing failure unrelated to the AVAD mode guard this test is intended to fence. My worry is that AC7 becomes flaky → gets quarantined → defeats its purpose as a regression fence against a future contributor over-broadening the guard. Could you please consider tightening the assertion to a bracket bounded by start/end-of-call rather than a tolerance? Something like:
DateTime lowerBound = DateTime.UtcNow.AddSeconds(-1);
await processor.StartAsync();
DateTime upperBound = DateTime.UtcNow.AddSeconds(-1).AddMilliseconds(1);
Assert.IsTrue(options.StartTime.Value >= lowerBound, $"StartTime {options.StartTime.Value:O} < lowerBound {lowerBound:O}");
Assert.IsTrue(options.StartTime.Value <= upperBound, $"StartTime {options.StartTime.Value:O} > upperBound {upperBound:O}");This bounds by start/end-of-call, removes the magic 5s number, would fail loud if the guard is over-broadened (StartTime stays null → NRE on .Value), and can't be invalidated by CI slowness.
5. Changelog double-attribution may signal "already fixed in 3.60.0-preview.0" to customers
After this PR merges, changelog.md will carry two #### Bugs Fixed entries for the same Mode != AllVersionsAndDeletes carve-out:
### Unreleased
#### Bugs Fixed
- [5852] ChangeFeedProcessor: Fixes AllVersionsAndDeletes cold-start regression introduced by #5617
### 3.60.0-preview.0 (released 2026-04-24)
#### Bugs Fixed
- [5825] ChangeFeedProcessor: Exempts AllVersionsAndDeletes from implicit StartTime back-off
(not applicable to LSN-based continuation)
PR #5825 actually merged 2026-05-15T23:24:52Z — three weeks AFTER the 3.60.0-preview.0 release date — so the [5825] line is sitting under a release that predates the actual merge. Verified:
$ gh api repos/Azure/azure-cosmos-dotnet-v3/pulls/5825 -q '.merged_at'
2026-05-15T23:24:52Z
$ sed -n '34,42p' changelog.md
### <a name="3.60.0-preview.0"/> [3.60.0-preview.0](...) - 2026-4-24
#### Bugs Fixed
- [5825] ChangeFeedProcessor: Exempts AllVersionsAndDeletes from implicit StartTime back-off (...)
Customer-impact concern: a customer scanning the changelog for the #5846 fix will see the [5825] line under shipped 3.60.0-preview.0 plus this PR's entry under Unreleased, and reasonably conclude the regression was already fixed in 3.60.0-preview.0 — when 3.60.0-preview.0 actually contains the regression. The customer-readable signal should be "regression shipped in 3.59.0 + 3.60.0-preview.0; fix is in the next release". Could you please help me think through whether the changelog can be re-organized to reflect the actual ship semantics? Two paths that come to mind:
- (a) Move the existing
[5825]Bugs Fixed line out of the3.60.0-preview.0block and into Unreleased (since #5825 merged three weeks after that release), then either merge it with this PR's entry or keep them side-by-side with the #5852 line framed as the customer-facing Known-Issues follow-up. - (b) Leave the historical entries alone but rephrase this PR's entry to make the relationship explicit, e.g., "[5852] ChangeFeedProcessor: Records customer-facing Bugs Fixed entry, expanded comment fence, and regression test coverage for the #5846 AllVersionsAndDeletes cold-start regression (fix landed via #5825; affects shipped 3.59.0 + 3.60.0-preview.0)".
Either keeps customers unambiguously informed; the current state may risk signaling "already fixed in 3.60.0-preview.0" because the [5825] line is under that release header.
Keeping this comment focused on the five items above. I noticed a handful of minor / nit-level things (Bugs Fixed entry phrasing polish, two propose-design-doc-extension follow-ups against cosmosdb-design-docs/12-change-feed.md for AVAD endpoint semantics + CFP cold-start matrix per mode, the stale mergeable: CONFLICTING GitHub flag that just needs a branch nudge to recompute, and a tangential Java-team consideration about a missing builder-validation test) but I'm intentionally not raising them here so the conversation stays focused. Happy to dig into any of those separately if useful.
Five non-blocking review findings from @kushagraThapar: * F2 (AC1 duplicates existing test from #5825): Drop the new StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode and expand the leading comment on the existing StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes (added by #5825) to cite both #5268 (why backfill exists) and #5846 (why AVAD is excluded). Forms a dual-fence with the LatestVersion test. * F3 (AC8 does not exercise the path it claims): Drop StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode_WithEmptyContinuationLease. The guard at ChangeFeedProcessorCore.cs reads options-level StartContinuation, not per-lease DocumentServiceLeaseCore.ContinuationToken, so both loop rows hit the same code path the existing AVAD test already covers. AC5 (emulator) + the existing AVAD unit test cover the AVAD code path end-to-end. * F4 (AC7 5-second tolerance may flake under CI load): Replace the Math.Abs(...) < 5 tolerance with start/end-of-call bracket bounds in StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit. The new assertion cannot be invalidated by CI slowness and fails loud if a future contributor over-broadens the AVAD guard (StartTime stays null -> NRE on .Value). * F5 (Changelog double-attribution risk): Rephrase the Unreleased [5852] Bugs Fixed bullet to make explicit that the regression shipped in 3.59.0 + 3.60.0-preview.0, the functional fix landed via #5825, and this PR adds defensive coverage + the customer-facing Known Issues entry. Avoids signaling 'already fixed in 3.60.0-preview.0'. F1 (PR title / description divergence) addressed separately by editing the PR title and body via the GitHub UI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks @kushagraThapar — every one of these landed. Addressed in F1 — PR title / description divergence vs actual diff ✅You're right that the functional F2 — AC1 duplicates existing master test (post-rebase collision risk) ✅You're right, both about the duplication and the post-rebase collision risk. Took your path (a) — dropped F3 — AC8 does not exercise the code path it claims ✅Caught — you're right, F4 — AC7 regression-fence tolerance may flake under CI load ✅Took your bracket pattern verbatim. The assertion in DateTime lowerBound = DateTime.UtcNow.AddSeconds(-1);
await processor.StartAsync();
DateTime upperBound = DateTime.UtcNow.AddSeconds(-1).AddMilliseconds(1);
Assert.IsTrue(options.StartTime.HasValue);
Assert.AreEqual(DateTimeKind.Utc, options.StartTime.Value.Kind);
Assert.IsTrue(
options.StartTime.Value >= lowerBound,
$"StartTime {options.StartTime.Value:O} is earlier than pre-call lowerBound {lowerBound:O}.");
Assert.IsTrue(
options.StartTime.Value <= upperBound,
$"StartTime {options.StartTime.Value:O} is later than post-call upperBound {upperBound:O}.");Cannot be invalidated by CI slowness, fails loud if the guard is over-broadened ( F5 — Changelog double-attribution may signal "already fixed in 3.60.0-preview.0" ✅Good catch. Took your path (b) — left the historical sections alone (less risk than rewriting released-section history) and rephrased this PR's
A customer scanning Unreleased now sees the chronology unambiguously: regression shipped in Re your "minor / nit-level" items you intentionally didn't raise — happy to dig into any of them whenever it's useful (Bugs Fixed phrasing polish, the two |
…5852-ci # Conflicts: # changelog.md
Issue
Fixes #5846
Regressing PR
#5617 (which fixed #5268).
Scope (revised after PR review)
The functional
Mode != ChangeFeedMode.AllVersionsAndDeletescarve-out for theStartTimeback-off landed via #5825 (merged 2026-05-15). This PR opened earlier (2026-05-11) with the same carve-out as its sole purpose, but #5825 picked it up first.After @kushagraThapar's review, this PR's scope is now:
Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/ChangeFeedProcessorCore.cs— the expanded comment block at lines 79-87 cites bothChangeFeedProcessorskips the first change unless start time is specified #5268 (why the back-off exists) and ChangeFeedProcessor: Regression in PR #5617 breaks AllVersionsAndDeletes push processor on cold start #5846 (why AVAD is excluded), with explicit "do not drop the mode guard without re-validating both" admonition.StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit— symmetric companion to the existing AVAD test; over-broadening the new guard regressesChangeFeedProcessorskips the first change unless start time is specified #5268 loudly. Assertion uses start/end-of-call bracket bounds (per F4 review feedback) so it cannot be invalidated by CI slowness.StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes(from ChangeFeed: Adds Full Fidelity Change Feed APIs to GA surface #5825) citing bothChangeFeedProcessorskips the first change unless start time is specified #5268 and ChangeFeedProcessor: Regression in PR #5617 breaks AllVersionsAndDeletes push processor on cold start #5846.TestAllVersionsAndDeletesProcessor_ColdStart_DoesNotFail— emulator-level end-to-end coverage of the AVAD cold-start path (BuilderTests.cs).The actual
ChangeFeedProcessorCore.csnet diff for this PR is+7/-3lines — all in the comment block; the functional guard itself is at line 92 and was merged via #5825.Why decoupled from #5825
#5846 is customer-reported against shipped 3.59.0 and 3.60.0-preview.0. This PR is a cleanly cherry-pickable defensive layer (regression tests + comment fence + customer-facing changelog) that does not depend on the GA-promotion surface in #5825 (contract JSONs, Encryption override, GA-flag flip). Either PR can land independently.
Test plan
StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes(unit, from #5825)StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit(unit)StartAsync_SetsStartTime_WhenNoStartOptionsProvided(unit)StartAsync_DoesNotOverrideExplicitStartTime(unit)StartAsync_DoesNotSetStartTime_WhenStartFromBeginning(unit)StartAsync_ThrowsWhenAllVersionsAndDeletes_WithStartFromBeginning/WithStartTime(unit)TestAllVersionsAndDeletesProcessor_ColdStart_DoesNotFail(emulator)TestWithRunningProcessor_ImmediateWriteAfterStart(emulator)Local validation (post-review-feedback push
f295cc569):dotnet build— 0 errors / 0 warnings on the test project.dotnet test ... --filter ChangeFeedProcessorCoreTests— 18/18 passed.Review status
Phase 5 iteration 2 — review feedback from @kushagraThapar (2026-05-19) addressed in commit
f295cc569:StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode; expanded leading comment on the existingStartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes(from #5825) to cite both #5268 and #5846.StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode_WithEmptyContinuationLease. The guard reads options-levelStartContinuation, not per-leaseContinuationToken, so both rows hit the same code path the existing AVAD test covers. AC5 (emulator) covers the AVAD code path end-to-end.Math.Abs(...) < 5with start/end-of-call bracket bounds: pre-calllowerBound = UtcNow - 1s, post-callupperBound = UtcNow - 1s + 1ms. Cannot be invalidated by CI slowness.[5852]bullet to make explicit that the regression shipped in 3.59.0 + 3.60.0-preview.0, the functional fix landed via #5825, and this PR adds defensive coverage + the customer-facing Known Issues entry. Historical sections left intact.Phase 5 iteration 1 — automated Deep Reviewer (Opus 4.7 High) returned
comment(no critical/major; 2 minor + 1 suggestion):.gitignorechange bundled in branch.coding-harness/to.gitignore. Harness scratch state only; no consumer impact.Known issuestable inchangelog.mdpointing at #5846 (regression) and #5852 (this fix).Cross-SDK Impact
This regression is .NET-only by construction:
ChangeFeedProcessorBuilderhas no analogue of the LatestVersionStartTimebackfill.No cross-SDK action required.
Reviewer questions
Known Issues entry for 3.59.0 / 3.60.0-preview.0?Resolved. Per F2 (iter 1) above.Changelog PR-number placeholder.Resolved. PR number5852substituted inchangelog.md.ChangeFeedProcessorskips the first change unless start time is specified #5268-style first-change-skip race? This fix explicitly does not mitigate that hypothetical race (no evidence one exists). Recommend filing a tracking issue at merge time. Still open — reviewer call. See also @jeet1995's lease-timeChangeFeedStartFrom.Nowprefetch proposal in the thread above.