Skip to content

ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846)#5852

Open
NaluTripician wants to merge 11 commits into
mainfrom
users/ntripician/fix-cfp-avad-cold-start-5846
Open

ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846)#5852
NaluTripician wants to merge 11 commits into
mainfrom
users/ntripician/fix-cfp-avad-cold-start-5846

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented May 11, 2026

Issue

Fixes #5846

Regressing PR

#5617 (which fixed #5268).

Scope (revised after PR review)

The functional Mode != ChangeFeedMode.AllVersionsAndDeletes carve-out for the StartTime back-off landed via #5825 (merged 2026-05-15). This PR opened earlier (2026-05-11) with the same carve-out as its sole purpose, but #5825 picked it up first.

After @kushagraThapar's review, this PR's scope is now:

The actual ChangeFeedProcessorCore.cs net diff for this PR is +7/-3 lines — all in the comment block; the functional guard itself is at line 92 and was merged via #5825.

Why decoupled from #5825

#5846 is customer-reported against shipped 3.59.0 and 3.60.0-preview.0. This PR is a cleanly cherry-pickable defensive layer (regression tests + comment fence + customer-facing changelog) that does not depend on the GA-promotion surface in #5825 (contract JSONs, Encryption override, GA-flag flip). Either PR can land independently.

Test plan

Test Status
StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes (unit, from #5825) enriched leading comment cites both #5268 and #5846
StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit (unit) new — symmetric companion; defends #5268 if a future contributor over-broadens the AVAD guard. Bracket-bound assertion (per F4 review).
StartAsync_SetsStartTime_WhenNoStartOptionsProvided (unit) unchanged — defends #5268 (default-Mode safety net)
StartAsync_DoesNotOverrideExplicitStartTime (unit) unchanged
StartAsync_DoesNotSetStartTime_WhenStartFromBeginning (unit) unchanged
StartAsync_ThrowsWhenAllVersionsAndDeletes_WithStartFromBeginning / WithStartTime (unit) unchanged
TestAllVersionsAndDeletesProcessor_ColdStart_DoesNotFail (emulator) new — defends #5846 end-to-end; observer must receive ≥1 change without HTTP 400
TestWithRunningProcessor_ImmediateWriteAfterStart (emulator) unchanged — defends #5268 end-to-end

Local validation (post-review-feedback push f295cc569):

  • dotnet build0 errors / 0 warnings on the test project.
  • dotnet test ... --filter ChangeFeedProcessorCoreTests18/18 passed.
  • Emulator tests run in CI (no local emulator available).

Review status

Phase 5 iteration 2 — review feedback from @kushagraThapar (2026-05-19) addressed in commit f295cc569:

Finding Severity Decision Resolution
F1 — PR title / description divergence vs actual diff major (non-blocking) fix PR title + description rewritten to position scope as defensive coverage + customer changelog (this rewrite).
F2 — AC1 duplicates existing master test from #5825 major (non-blocking) fix Dropped StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode; expanded leading comment on the existing StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes (from #5825) to cite both #5268 and #5846.
F3 — AC8 does not exercise the path it claims major (non-blocking) fix Dropped StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode_WithEmptyContinuationLease. The guard reads options-level StartContinuation, not per-lease ContinuationToken, so both rows hit the same code path the existing AVAD test covers. AC5 (emulator) covers the AVAD code path end-to-end.
F4 — AC7 5-second tolerance may flake under CI load major (non-blocking) fix Replaced Math.Abs(...) < 5 with start/end-of-call bracket bounds: pre-call lowerBound = UtcNow - 1s, post-call upperBound = UtcNow - 1s + 1ms. Cannot be invalidated by CI slowness.
F5 — Changelog double-attribution risk major (non-blocking) fix (path b) Rephrased the Unreleased [5852] bullet to make explicit that the regression shipped in 3.59.0 + 3.60.0-preview.0, the functional fix landed via #5825, and this PR adds defensive coverage + the customer-facing Known Issues entry. Historical sections left intact.

Phase 5 iteration 1 — automated Deep Reviewer (Opus 4.7 High) returned comment (no critical/major; 2 minor + 1 suggestion):

Finding Severity Decision Resolution
F1 (iter 1) — Unrelated .gitignore change bundled in branch minor skip + note Branch carries a single ancillary commit adding .coding-harness/ to .gitignore. Harness scratch state only; no consumer impact.
F2 (iter 1) — Missing Known Issues entries for 3.59.0 / 3.60.0-preview.0 minor fix Added a row to the existing Known issues table in changelog.md pointing at #5846 (regression) and #5852 (this fix).
F3 (iter 1) — AC7 timing tolerance suggestion superseded Iter 1 widened the tolerance to 5s; iter 2 superseded that with bracket bounds (see F4 above).

Cross-SDK Impact

This regression is .NET-only by construction:

SDK Status Evidence
Java bug_not_present The Java AVAD ChangeFeedProcessorBuilder has no analogue of the LatestVersion StartTime backfill.
Python / Node / Go / Spark connector bug_not_present The regressing PR (#5617) touched only .NET files.

No cross-SDK action required.

Reviewer questions

  1. Known Issues entry for 3.59.0 / 3.60.0-preview.0? Resolved. Per F2 (iter 1) above.
  2. Changelog PR-number placeholder. Resolved. PR number 5852 substituted in changelog.md.
  3. Follow-up issue: investigate whether AVAD has a ChangeFeedProcessor skips the first change unless start time is specified #5268-style first-change-skip race? This fix explicitly does not mitigate that hypothetical race (no evidence one exists). Recommend filing a tracking issue at merge time. Still open — reviewer call. See also @jeet1995's lease-time ChangeFeedStartFrom.Now prefetch proposal in the thread above.

NaluTripician and others added 4 commits May 11, 2026 11:46
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n introduced by #5617

- Add `Mode != AllVersionsAndDeletes` guard to the StartTime backfill in
  ChangeFeedProcessorCore.StartAsync() so AVAD push processors no longer
  issue an explicit StartTime on a null-continuation lease (which the
  AVAD endpoint 400s).
- Preserve the existing LatestVersion backfill behavior introduced by PR
  #5617 (the fix for #5268).
- Add unit tests AC1 (AVAD-negative), AC7 (explicit-LV positive,
  symmetric companion to AC1), AC8 (AVAD + empty-continuation lease
  for SE-7 upgrade path).
- Add emulator regression test AC5 in CFP/AllVersionsAndDeletes/.
- Add changelog entry under Unreleased.

Fixes #5846

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- F2 (minor): Adds Known Issues entry pointing at #5846/#5852 to the
  changelog Known issues table so customers on 3.59.0 / 3.60.0-preview.0
  see the cold-start AVAD regression and its workaround.
- F3 (suggestion): Widens AC7 timing tolerance from 2s to 5s in
  ChangeFeedProcessorCoreTests.cs to harden against CI flake under
  parallel load. The assertion still defends the spec intent
  (StartTime ~ UtcNow - 1s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

/azp run dotnet-v3-ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@kundadebdatta
Copy link
Copy Markdown
Member

Is this not covered by the PR-5825 from @ananth7592 ? If so - let's close this as duplicate ?

@NaluTripician
Copy link
Copy Markdown
Contributor Author

@kundadebdatta thanks for flagging — the chronology actually runs the other way. #5852 was opened on 2026-05-11 with the AVAD StartTime carve-out as its sole purpose; #5825 picked the same carve-out up on 2026-05-12 in commit 8234a986 while addressing its Deep Review comments. So the overlap is real, but #5852 is the originating fix, not the duplicate.

I'd prefer to keep #5852 and have #5825 drop the duplicated ChangeFeedProcessorCore.cs hunk, because:

  1. ChangeFeedProcessor: Regression in PR #5617 breaks AllVersionsAndDeletes push processor on cold start #5846 is customer-reported against shipped 3.59.0 and 3.60.0-preview.0. ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846) #5852 is a 5-file / +249 hotfix that's cleanly cherry-pickable; ChangeFeed: Adds Full Fidelity Change Feed APIs to GA surface #5825 is a 17-file GA-promotion PR (including contract JSONs and the Encryption override that's still in review with @kushagraThapar) — not a hotfix vehicle. Decoupling the regression fix from the GA promotion lets us ship a 3.60.0-preview.1 (and optionally a 3.59.x) without waiting on the GA decision.
  2. Behavior parity: ChangeFeed: Adds Full Fidelity Change Feed APIs to GA surface #5825 also adds two new InvalidOperationException throws in StartAsync for AVAD + StartFromBeginning / + explicit StartTime. Those are net-new validation behavior on top of the regression fix; ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846) #5852 reverts strictly to pre-ChangeFeedProcessor: Fixes first-change skip during initial startup by anchoring StartTime #5617 StartFromNow semantics. Whether we want the extra throws is a separate decision and arguably belongs in the GA-promotion PR or a follow-up — not bundled into a customer hotfix.
  3. Coverage: ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846) #5852 carries the only end-to-end emulator test (TestAllVersionsAndDeletesProcessor_ColdStart_DoesNotFail in BuilderTests.cs), plus AC7 (LatestVersion symmetry — fails if a future contributor over-broadens the guard) and AC8 (upgrade-path leases with null / "" continuation, defending SE-7 in the side-effects log for customers upgrading from the buggy SDK). ChangeFeed: Adds Full Fidelity Change Feed APIs to GA surface #5825's single AVAD unit test doesn't cover any of these.
  4. The inline comment in ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846) #5852 cites both ChangeFeedProcessor skips the first change unless start time is specified #5268 and ChangeFeedProcessor: Regression in PR #5617 breaks AllVersionsAndDeletes push processor on cold start #5846 explicitly as an anti-refactor fence.

Happy to coordinate with @ananth7592 on rebasing #5825 onto #5852 (or just dropping the ChangeFeedProcessorCore.cs hunk from #5825 once #5852 merges) — that way #5825 stays pure GA-surface work.

jeet1995
jeet1995 previously approved these changes May 14, 2026
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mid-term - what are your thoughts on issuing a changefeed from now to populate continuation for a given partition at the time of lease creation - might help with AVAD cases where document drain start is pinned to lease creation start? Thoughts @xinlian12 @FabianMeiswinkel @kushagraThapar

@NaluTripician
Copy link
Copy Markdown
Contributor Author

Thanks @jeet1995 for the approval and the forward-looking thought — I really like this direction.

A lease-time ChangeFeedStartFrom.Now prefetch is strictly cleaner than any StartTime heuristic, for three reasons:

  1. Eliminates the race at the source. Even after this fix, AVAD cold-start has a small window between lease creation and the first changefeed request where writes could land unobserved (the hypothetical SE-3 / open reviewer question Fixed the maxConcurrency parameter summary #3 in the description). Pinning an etag at lease creation collapses that window to zero — the server-side anchor is established before document drain begins.
  2. Symmetric across modes, and removes the heuristic. The same lease-time prefetch fixes ChangeFeedProcessor skips the first change unless start time is specified #5268 (LatestVersion first-change-skip) without needing the StartTime = UtcNow.AddSeconds(-1) backfill at all. Mid-term, this PR's mode-guard and the original backfill both go away — replaced by a precise server-anchored etag persisted in the lease's existing ContinuationToken field.
  3. No clock skew, no magic number. AddSeconds(-1) is a guess about replication / clock skew. A real etag is exact.

Tradeoffs worth flagging for @xinlian12 / @FabianMeiswinkel / @kushagraThapar:

  • Cost. One extra round-trip per partition at lease creation. Amortized over the lease's lifetime it's negligible, but worth confirming for large-partition bootstrap.
  • Compatibility. Leases written by 3.57.0-preview.1 / 3.58.x / 3.59.x won't carry a prefetched token. Need to keep a fallback path (today's ChangeFeedStartFrom.Now for AVAD, the StartTime backfill for LatestVersion) so the upgrade is strictly additive.
  • Where it lives. DocumentServiceLeaseManager.CreateLeaseIfNotExistAsync rather than ChangeFeedProcessorCore.StartAsync — keeps this hotfix's surface unchanged and lets the refactor land cleanly on top.
  • Cross-SDK. Java's CFP doesn't carry the LatestVersion backfill (Cross-SDK Impact table above), so they'd gain race-elimination but not a regression fix. Worth a spec-level alignment discussion.

I'd propose tracking this as a follow-up work item — it naturally subsumes reviewer question #3. Happy to open the issue when this hotfix merges so customers on 3.59.0 / 3.60.0-preview.0 aren't blocked on the mid-term design.

…cfp-avad-cold-start-5846

# Conflicts:
#	.gitignore
#	changelog.md
Conflicts resolved:
- Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/ChangeFeedProcessorCore.cs:
  Adopted main's defensive structure from #5825 (explicit AVAD throws for
  StartFromBeginning/StartTime plus the shouldAnchorStartTime boolean). The
  Mode != AllVersionsAndDeletes guard ΓÇö the core fix from #5852 ΓÇö is
  preserved. Extended the comment to cite both #5825 and #5846 so future
  readers see the LSN-based rationale and the original cold-start
  regression context.
- changelog.md: Kept both Unreleased Bugs Fixed entries (#5852 cold-start
  regression and #5870 hedging StackOverflow).

Validated by building Microsoft.Azure.Cosmos.Tests and running the
ChangeFeedProcessorCoreTests filter; all 20 tests pass, including both the
PR's regression tests (AC1/AC7/AC8) and main's defensive throw tests added
in #5825.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First — thank you for the very thorough comment expansion in Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/ChangeFeedProcessorCore.cs:79-87. The expanded "Do not drop the mode guard without re-validating both #5268 and #5846" admonition, with both upstream and downstream PR citations inline, is exactly the kind of in-source documentation the recurring CFP regression chain (#105#158#2491#5807#5617#5846) has historically lacked. That's the most durable artifact in this diff and I'd love to see this pattern applied elsewhere in the SDK going forward.

I have five things below I'd love your thoughts on — none are blocking, all are framed as questions, but I think they're worth working through together before merge. I'm intentionally not raising the minor / nit-level items I noticed (changelog phrasing polish, two propose-design-doc-extension follow-ups for cosmosdb-design-docs/12-change-feed.md, the stale mergeable: CONFLICTING GitHub flag, and a tangential Java-team consideration) so this comment stays focused.


Major findings

1. PR description / title divergence vs actual diff

Could you please help me understand the framing of the PR description? The Summary section states "The fix is a single-line addition to the existing backfill condition" and shows a code block with the new Mode != ChangeFeedMode.AllVersionsAndDeletes clause — but that clause is already in origin/main as of #5825 (commit d044d4c1f, merged 2026-05-15), which I reviewed and approved. I verified by directly comparing gh api /repos/.../contents/...?ref=main against ?ref=users/ntripician/fix-cfp-avad-cold-start-5846:

$ git show origin/main:.../ChangeFeedProcessorCore.cs | sed -n '88,92p'
                bool shouldAnchorStartTime =
                    !this.changeFeedProcessorOptions.StartFromBeginning
                    && this.changeFeedProcessorOptions.StartTime == null
                    && string.IsNullOrEmpty(this.changeFeedProcessorOptions.StartContinuation)
                    && this.changeFeedProcessorOptions.Mode != ChangeFeedMode.AllVersionsAndDeletes;

The actual ChangeFeedProcessorCore.cs net diff for this PR is +7/-3 lines — all in the comment block. I saw your earlier reply to @kundadebdatta acknowledging the chronology (this PR opened 2026-05-11, #5825 picked the same carve-out up on 2026-05-12), and I agree this PR has genuine value as a docs + test-coverage + customer-changelog layer over the already-shipped fix. Could you please consider updating the PR description Summary and title to reflect that reality? Something like "This PR's scope is (a) defensive comment hardening citing both #5268 and #5846, (b) targeted regression test coverage (AC1, AC5, AC7, AC8) above what #5825 carried, and (c) a customer-facing Bugs Fixed + Known Issues changelog entry for #5846 on shipped 3.59.0 / 3.60.0-preview.0. The functional guard itself landed via #5825" — and a title along the lines of ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846)?


2. AC1 unit test appears to duplicate existing master test (post-rebase collision risk)

In Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/ChangeFeed/ChangeFeedProcessorCoreTests.cs, the new StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode test (AC1, around line 272 on the head branch) appears functionally identical to the pre-existing StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes at line 444 (added by #5825). Same Mock<DocumentServiceLeaseStore> / Mock<DocumentServiceLeaseContainer> / Mock<DocumentServiceLeaseStoreManager> setup, same Mode = AllVersionsAndDeletes options, same Assert.IsNull(options.StartTime). Verified by grep:

$ grep -n "StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes" .../ChangeFeedProcessorCoreTests.cs
276:        public async Task StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode()
444:        public async Task StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes()

$ git show origin/main:.../ChangeFeedProcessorCoreTests.cs | grep -n "StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes"
273:        public async Task StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes()

Rebase-collision risk: once you rebase to resolve the mergeable: CONFLICTING state, both tests will land in the same file with effectively identical bodies — future readers will see two methods asserting the same thing back-to-back and won't know which is canonical. Worth deduplicating before rebase rather than after. Could you please help me understand what new coverage AC1 adds over the existing test? If the intent is to cite #5846 as an additional anti-refactor reference, would it be lighter-weight to either (a) drop AC1 and expand the leading comment / assertion-failure message on the existing test to cite both #5617 and #5846, or (b) consolidate both into a single [DataTestMethod] parameterized by rationale tag (e.g., [DataRow("#5268-AC1-defense")] / [DataRow("#5825-original")]) so the dual-fence intent survives without two near-identical method bodies? Either path would also resolve the merge conflict on this file in passing.


3. AC8 does not exercise the code path it claims to defend

In the same test file, AC8 (StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode_WithEmptyContinuationLease, around line 388) iterates over null and string.Empty for the per-lease DocumentServiceLeaseCore.ContinuationToken, asserting options.StartTime stays null. But the shouldAnchorStartTime decision at ChangeFeedProcessorCore.cs:88-92 reads this.changeFeedProcessorOptions.StartContinuation — an options-level field, not the per-lease ContinuationToken:

$ grep -n "StartContinuation" \
    .../ChangeFeedProcessorCore.cs \
    .../Configuration/ChangeFeedProcessorOptions.cs
.../ChangeFeedProcessorCore.cs:91:                && string.IsNullOrEmpty(this.changeFeedProcessorOptions.StartContinuation)
.../Configuration/ChangeFeedProcessorOptions.cs:46:        public string StartContinuation { get; set; }

Both rows of the AC8 loop hit the same code path AC1 already exercises, so the lease-mock variations are functionally inert. The PR description positions AC8 as defending the SE-7 upgrade-from-buggy-SDK lease state, but the assertion doesn't depend on the lease state being set up. I am curious to learn what specific code path AC8's ContinuationToken = null vs "" rows are intended to exercise. If the intent is to defend the SE-7 upgrade-path behavior, would it make sense to either (a) drop AC8 (covered by AC1 functionally + AC5 end-to-end via the emulator), or (b) keep AC8 but add an assertion against the materialized ChangeFeedStartFrom (via partition initializer) so the lease ContinuationToken values actually affect what's tested?


4. AC7 regression-fence reliability — tolerance shape may flake under CI load

For StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit (AC7, around line 370 in the test file), the assertion captures expectedApprox = DateTime.UtcNow.AddSeconds(-1) before StartAsync runs, then compares against the post-StartAsync options.StartTime with a 5-second tolerance:

DateTime expectedApprox = DateTime.UtcNow.AddSeconds(-1);
await processor.StartAsync();
Assert.IsTrue(
    Math.Abs((expectedApprox - options.StartTime.Value).TotalSeconds) < 5,
    ...);

Under heavy CI load (mock-heavy harness, lease-store mock initialization, partition manager warm-up), StartAsync could plausibly take >5s, producing a confusing failure unrelated to the AVAD mode guard this test is intended to fence. My worry is that AC7 becomes flaky → gets quarantined → defeats its purpose as a regression fence against a future contributor over-broadening the guard. Could you please consider tightening the assertion to a bracket bounded by start/end-of-call rather than a tolerance? Something like:

DateTime lowerBound = DateTime.UtcNow.AddSeconds(-1);
await processor.StartAsync();
DateTime upperBound = DateTime.UtcNow.AddSeconds(-1).AddMilliseconds(1);
Assert.IsTrue(options.StartTime.Value >= lowerBound, $"StartTime {options.StartTime.Value:O} < lowerBound {lowerBound:O}");
Assert.IsTrue(options.StartTime.Value <= upperBound, $"StartTime {options.StartTime.Value:O} > upperBound {upperBound:O}");

This bounds by start/end-of-call, removes the magic 5s number, would fail loud if the guard is over-broadened (StartTime stays null → NRE on .Value), and can't be invalidated by CI slowness.


5. Changelog double-attribution may signal "already fixed in 3.60.0-preview.0" to customers

After this PR merges, changelog.md will carry two #### Bugs Fixed entries for the same Mode != AllVersionsAndDeletes carve-out:

### Unreleased
  #### Bugs Fixed
  - [5852] ChangeFeedProcessor: Fixes AllVersionsAndDeletes cold-start regression introduced by #5617

### 3.60.0-preview.0  (released 2026-04-24)
  #### Bugs Fixed
  - [5825] ChangeFeedProcessor: Exempts AllVersionsAndDeletes from implicit StartTime back-off
          (not applicable to LSN-based continuation)

PR #5825 actually merged 2026-05-15T23:24:52Z — three weeks AFTER the 3.60.0-preview.0 release date — so the [5825] line is sitting under a release that predates the actual merge. Verified:

$ gh api repos/Azure/azure-cosmos-dotnet-v3/pulls/5825 -q '.merged_at'
2026-05-15T23:24:52Z

$ sed -n '34,42p' changelog.md
### <a name="3.60.0-preview.0"/> [3.60.0-preview.0](...) - 2026-4-24
#### Bugs Fixed
- [5825] ChangeFeedProcessor: Exempts AllVersionsAndDeletes from implicit StartTime back-off (...)

Customer-impact concern: a customer scanning the changelog for the #5846 fix will see the [5825] line under shipped 3.60.0-preview.0 plus this PR's entry under Unreleased, and reasonably conclude the regression was already fixed in 3.60.0-preview.0 — when 3.60.0-preview.0 actually contains the regression. The customer-readable signal should be "regression shipped in 3.59.0 + 3.60.0-preview.0; fix is in the next release". Could you please help me think through whether the changelog can be re-organized to reflect the actual ship semantics? Two paths that come to mind:

  • (a) Move the existing [5825] Bugs Fixed line out of the 3.60.0-preview.0 block and into Unreleased (since #5825 merged three weeks after that release), then either merge it with this PR's entry or keep them side-by-side with the #5852 line framed as the customer-facing Known-Issues follow-up.
  • (b) Leave the historical entries alone but rephrase this PR's entry to make the relationship explicit, e.g., "[5852] ChangeFeedProcessor: Records customer-facing Bugs Fixed entry, expanded comment fence, and regression test coverage for the #5846 AllVersionsAndDeletes cold-start regression (fix landed via #5825; affects shipped 3.59.0 + 3.60.0-preview.0)".

Either keeps customers unambiguously informed; the current state may risk signaling "already fixed in 3.60.0-preview.0" because the [5825] line is under that release header.


Keeping this comment focused on the five items above. I noticed a handful of minor / nit-level things (Bugs Fixed entry phrasing polish, two propose-design-doc-extension follow-ups against cosmosdb-design-docs/12-change-feed.md for AVAD endpoint semantics + CFP cold-start matrix per mode, the stale mergeable: CONFLICTING GitHub flag that just needs a branch nudge to recompute, and a tangential Java-team consideration about a missing builder-validation test) but I'm intentionally not raising them here so the conversation stays focused. Happy to dig into any of those separately if useful.

Five non-blocking review findings from @kushagraThapar:

* F2 (AC1 duplicates existing test from #5825): Drop the new
  StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode and
  expand the leading comment on the existing
  StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes (added by
  #5825) to cite both #5268 (why backfill exists) and #5846 (why AVAD
  is excluded). Forms a dual-fence with the LatestVersion test.

* F3 (AC8 does not exercise the path it claims): Drop
  StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode_WithEmptyContinuationLease.
  The guard at ChangeFeedProcessorCore.cs reads options-level
  StartContinuation, not per-lease DocumentServiceLeaseCore.ContinuationToken,
  so both loop rows hit the same code path the existing AVAD test
  already covers. AC5 (emulator) + the existing AVAD unit test cover
  the AVAD code path end-to-end.

* F4 (AC7 5-second tolerance may flake under CI load): Replace the
  Math.Abs(...) < 5 tolerance with start/end-of-call bracket bounds
  in StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit. The new
  assertion cannot be invalidated by CI slowness and fails loud if a
  future contributor over-broadens the AVAD guard (StartTime stays
  null -> NRE on .Value).

* F5 (Changelog double-attribution risk): Rephrase the Unreleased
  [5852] Bugs Fixed bullet to make explicit that the regression
  shipped in 3.59.0 + 3.60.0-preview.0, the functional fix landed via
  #5825, and this PR adds defensive coverage + the customer-facing
  Known Issues entry. Avoids signaling 'already fixed in
  3.60.0-preview.0'.

F1 (PR title / description divergence) addressed separately by
editing the PR title and body via the GitHub UI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician changed the title ChangeFeedProcessor: Fixes AllVersionsAndDeletes cold-start regression introduced by #5617 ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846) May 19, 2026
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Thanks @kushagraThapar — every one of these landed. Addressed in f295cc569 (just pushed) plus a rewrite of the PR title and description. Walking through each:

F1 — PR title / description divergence vs actual diff ✅

You're right that the functional Mode != AllVersionsAndDeletes guard landed via #5825 on 2026-05-15, and the actual diff on this branch (apart from tests + changelog) is +7/-3 comment-only lines in ChangeFeedProcessorCore.cs. Took your proposed framing essentially verbatim — new title: ChangeFeedProcessor: Adds defensive regression coverage and changelog for AllVersionsAndDeletes cold-start fix (#5846). The PR description is fully rewritten to position the scope as (a) defensive comment hardening, (b) targeted regression tests above what #5825 carried, (c) customer-facing changelog + Known Issues entries. Each "fix" row in the new Review status table cites the specific finding (F1–F5) so the chain stays auditable.

F2 — AC1 duplicates existing master test (post-rebase collision risk) ✅

You're right, both about the duplication and the post-rebase collision risk. Took your path (a) — dropped StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode (the AC1 method) and enriched the leading comment on the existing StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletes (from #5825) to cite both #5268 and #5846 explicitly, including the "do not delete either without re-validating both" admonition. The dual-fence intent is preserved as a comment on the existing test paired with the new LatestVersion test, so there are no longer two near-identical method bodies and the merge conflict on this file is resolved in passing.

F3 — AC8 does not exercise the code path it claims ✅

Caught — you're right, shouldAnchorStartTime reads this.changeFeedProcessorOptions.StartContinuation (options-level), not per-lease DocumentServiceLeaseCore.ContinuationToken, so both rows of the loop hit the same code path AC1 already covered. Took your path (a) — dropped StartAsync_DoesNotSetStartTime_WhenAllVersionsAndDeletesMode_WithEmptyContinuationLease entirely. The AVAD cold-start path is now covered by the existing unit test (with the enriched comment fence) and AC5 end-to-end via TestAllVersionsAndDeletesProcessor_ColdStart_DoesNotFail in the emulator suite. The SE-7 upgrade-from-buggy-SDK lease state would need a deeper assertion against materialized ChangeFeedStartFrom (your option b) which isn't a great fit for this defensive layer — happy to revisit if the SE-7 path turns out to need explicit coverage.

F4 — AC7 regression-fence tolerance may flake under CI load ✅

Took your bracket pattern verbatim. The assertion in StartAsync_SetsStartTime_WhenLatestVersionMode_Explicit is now:

DateTime lowerBound = DateTime.UtcNow.AddSeconds(-1);
await processor.StartAsync();
DateTime upperBound = DateTime.UtcNow.AddSeconds(-1).AddMilliseconds(1);

Assert.IsTrue(options.StartTime.HasValue);
Assert.AreEqual(DateTimeKind.Utc, options.StartTime.Value.Kind);
Assert.IsTrue(
    options.StartTime.Value >= lowerBound,
    $"StartTime {options.StartTime.Value:O} is earlier than pre-call lowerBound {lowerBound:O}.");
Assert.IsTrue(
    options.StartTime.Value <= upperBound,
    $"StartTime {options.StartTime.Value:O} is later than post-call upperBound {upperBound:O}.");

Cannot be invalidated by CI slowness, fails loud if the guard is over-broadened (StartTime stays null → NRE on .Value), and the leading comment now explicitly calls out the "fixed-tolerance windows can silently quarantine the regression fence" rationale so a future contributor doesn't reintroduce a magic number.

F5 — Changelog double-attribution may signal "already fixed in 3.60.0-preview.0" ✅

Good catch. Took your path (b) — left the historical sections alone (less risk than rewriting released-section history) and rephrased this PR's [5852] Bugs Fixed bullet under Unreleased to make the ship semantics explicit. New text:

[5852] ChangeFeedProcessor: Fixes AllVersionsAndDeletes cold-start regression introduced by #5617 (affects shipped 3.59.0 and 3.60.0-preview.0; the functional Mode != AllVersionsAndDeletes guard for the StartTime back-off landed via #5825 — this PR adds defensive regression coverage, the in-source comment fence citing both #5268 and #5846, and the customer-facing Known Issues entry below).

A customer scanning Unreleased now sees the chronology unambiguously: regression shipped in 3.59.0 + 3.60.0-preview.0, functional fix landed via #5825, customer-facing entry + regression tests + Known Issues row are this PR.


Re your "minor / nit-level" items you intentionally didn't raise — happy to dig into any of them whenever it's useful (Bugs Fixed phrasing polish, the two propose-design-doc-extension follow-ups for cosmosdb-design-docs/12-change-feed.md on AVAD endpoint semantics + CFP cold-start matrix per mode, the stale mergeable: CONFLICTING flag that the new push should clear, and the Java-team builder-validation consideration). Open one whenever — and thanks again for the level of rigor on the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants