Skip to content

feat(firmware): wait for application readiness after PIC32 reconnect (closes #145)#200

Merged
tylerkron merged 17 commits into
mainfrom
feat/post-reconnect-readiness-probe
May 15, 2026
Merged

feat(firmware): wait for application readiness after PIC32 reconnect (closes #145)#200
tylerkron merged 17 commits into
mainfrom
feat/post-reconnect-readiness-probe

Conversation

@cptkoolbeenz
Copy link
Copy Markdown
Member

Summary

Adds an opt-in readiness probe so a PIC32 firmware update doesn't transition to Complete until the device is actually ready to answer normal application commands. Serial transport re-enumeration succeeds well before the application firmware is up; without this wait, downstream flows (LAN chip-info queries, WiFi prep) hit a half-started device — exactly the pattern desktop had to work around with its own retry loop.

API additions

FirmwareUpdateServiceOptions gains 3 new properties:

  • PostReconnectReadinessProbeFunc<IStreamingDevice, CancellationToken, Task<bool>>?. Returns true when the application is responsive. Null disables the wait (legacy behavior preserved).
  • PostReconnectReadinessTimeout (default 30s) — wall-clock budget for the wait.
  • PostReconnectReadinessRetryDelay (default 500ms) — delay between probe attempts.

When the timeout elapses without the probe returning true, the update transitions to Failed with a TimeoutException wrapped in FirmwareUpdateExceptionNOT a silent Complete on a half-ready device, which is the entire point of #145.

Test plan

  • Probe succeeds on attempt N>1 holds back Complete until ready (assert Complete state + probe call count = 3)
  • Timeout raises a properly wrapped FirmwareUpdateException with FailedState=JumpingToApp and the readiness keyword in the inner message
  • Null probe preserves legacy fast-complete path (belt-and-suspenders)
  • Build clean on net9.0 + net10.0
  • Full suite: 893/895 (2 skipped require live hardware)

Closes #145

…loses #145)

Adds an opt-in readiness probe to FirmwareUpdateService so a PIC32
firmware update doesn't transition to Complete until the device is
actually ready to answer normal application commands. The serial
transport re-enumerates well before the application firmware is up;
without this wait, downstream flows (LAN chip-info queries, WiFi
prep) hit a half-started device and either fail or have to
reimplement their own retry loop in the calling app — exactly the
pattern desktop had to work around.

API additions on FirmwareUpdateServiceOptions:
- PostReconnectReadinessProbe — Func<IStreamingDevice,
  CancellationToken, Task<bool>>?. Returns true when the application
  is responsive. Null disables the wait (legacy behavior).
- PostReconnectReadinessTimeout (default 30s) — wall-clock budget
- PostReconnectReadinessRetryDelay (default 500ms) — between probe
  attempts

When the timeout elapses without the probe returning true, the
update transitions to Failed with a clear TimeoutException wrapped
in FirmwareUpdateException — NOT a silent Complete on a half-ready
device.

Test plan: 3 new tests cover (a) probe succeeds on attempt N >1
holding back Complete until ready, (b) timeout raises a properly
wrapped FirmwareUpdateException with FailedState=JumpingToApp and
the readiness keyword in the inner message, (c) null probe preserves
legacy fast-complete path. Full suite 893/895 (2 skipped require
live hardware).
@cptkoolbeenz cptkoolbeenz requested a review from a team as a code owner May 12, 2026 05:12
@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add post-reconnect application readiness probe for PIC32 firmware updates

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add optional post-reconnect readiness probe to FirmwareUpdateService
• Probe waits for PIC32 application firmware readiness before completing update
• Timeout transitions update to Failed state instead of silent completion
• Preserve legacy behavior when probe is null (opt-in feature)
Diagram
flowchart LR
  A["Serial Reconnect"] --> B{"Readiness Probe<br/>Configured?"}
  B -->|Yes| C["Poll Probe<br/>with Retry"]
  B -->|No| D["Complete<br/>Legacy Path"]
  C -->|Success| E["Complete"]
  C -->|Timeout| F["Failed<br/>with TimeoutException"]
Loading

Grey Divider

File Changes

1. src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs ✨ Enhancement +32/-0

Add readiness probe configuration options

• Add PostReconnectReadinessProbe property: optional callback returning true when device is ready
• Add PostReconnectReadinessTimeout property with 30s default timeout
• Add PostReconnectReadinessRetryDelay property with 500ms default retry delay
• Add validation for new timeout and retry delay properties

src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs


2. src/Daqifi.Core/Firmware/FirmwareUpdateService.cs ✨ Enhancement +80/-0

Implement application readiness wait logic

• Invoke readiness probe after serial reconnect if configured
• Implement WaitForApplicationReadyAsync method with retry loop and timeout handling
• Poll probe with configurable delay between attempts
• Throw TimeoutException with descriptive message when probe never returns true
• Handle probe exceptions gracefully by treating as not-ready and retrying
• Log debug messages for probe attempts and success

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs


3. src/Daqifi.Core.Tests/Firmware/FirmwareUpdateServiceTests.cs 🧪 Tests +157/-0

Add three comprehensive readiness probe tests

• Add test verifying probe succeeds on attempt N>1 and holds back Complete state
• Add test verifying timeout fails update with proper exception wrapping
• Add test verifying null probe preserves legacy fast-complete behavior
• All tests validate probe call counts, state transitions, and exception details

src/Daqifi.Core.Tests/Firmware/FirmwareUpdateServiceTests.cs


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (1)

Grey Divider


Action required

1. Readiness probe is optional 📎 Requirement gap ☼ Reliability
Description
FirmwareUpdateService only waits for application readiness when PostReconnectReadinessProbe is
set; otherwise it proceeds immediately after serial reconnect, which can still return a half-started
device. This keeps the failure mode the checklist is trying to eliminate and can still force
downstream integrations to add their own readiness polling.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R817-820]

+        if (_options.PostReconnectReadinessProbe is { } probe)
+        {
+            await WaitForApplicationReadyAsync(device, probe, cancellationToken).ConfigureAwait(false);
+        }
Evidence
The checklist requires the flow to confirm application readiness after PIC32 reconnect and to remove
the need for downstream status-wait loops. The new code explicitly makes the readiness check opt-in
(if (_options.PostReconnectReadinessProbe is { } probe)), and the options explicitly
document/encode null as the legacy fast-complete path, meaning reconnect can still be treated as
complete without readiness verification.

Wait for application readiness after PIC32 reconnect in FirmwareUpdateService
Downstream callers should not need custom status-wait loops for firmware update reliability
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[810-820]
src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[124-140]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The post-reconnect readiness wait is currently opt-in (`PostReconnectReadinessProbe` is nullable and gated), so the firmware update flow can still proceed immediately after reconnect without confirming the application is ready.

## Issue Context
Checklist IDs 1 and 2 require Core’s firmware update lifecycle to handle post-reconnect application readiness so reconnect completion isn’t based solely on transport reopening, and downstream callers don’t need their own readiness loops.

## Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[810-820]
- src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[124-150]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Non-monotonic elapsed logging 🐞 Bug ◔ Observability
Description
WaitForApplicationReadyAsync computes the logged readiness wait duration using DateTime.UtcNow, so
if the system clock is adjusted during the wait, the emitted {Elapsed} can jump or be inaccurate,
confusing troubleshooting/telemetry. This does not break the timeout logic (CTS handles that), but
it can produce misleading diagnostics for slow-boot scenarios.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R892-896]

+                    var elapsed = DateTime.UtcNow - waitStart;
+                    _logger.LogInformation(
+                        "Device became application-ready after {Elapsed} on probe attempt {Attempt}.",
+                        elapsed,
+                        probesExecuted);
Evidence
The new code uses wall-clock timestamps to compute an elapsed duration for logging; elsewhere in the
repo, elapsed durations are measured with Stopwatch, indicating the preferred monotonic approach for
timing/elapsed reporting.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[839-840]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[892-896]
src/Daqifi.Core/Firmware/ProcessExternalProcessRunner.cs[30-33]
src/Daqifi.Core/Device/DaqifiStreamingDevice.cs[639-640]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`WaitForApplicationReadyAsync` measures elapsed time for logging via `DateTime.UtcNow`, which is wall-clock time and not monotonic. If NTP/VM clock corrections occur during the wait, the logged `Elapsed` can be misleading.

### Issue Context
Timeout enforcement is already handled via `CancellationTokenSource(totalTimeout)`; the improvement is specifically for *diagnostic accuracy* in the Information log emitted when readiness is reached.

### Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[839-840]
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[892-896]

### Suggested change
- Replace `var waitStart = DateTime.UtcNow;` with a monotonic timer, e.g. `var sw = System.Diagnostics.Stopwatch.StartNew();`
- Replace `var elapsed = DateTime.UtcNow - waitStart;` with `var elapsed = sw.Elapsed;`
- (Optional) add `using System.Diagnostics;` at the top, or keep the fully-qualified `System.Diagnostics.Stopwatch` usage.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Stale probe exception kept ✓ Resolved 🐞 Bug ≡ Correctness
Description
WaitForApplicationReadyAsync retains lastProbeException from an earlier thrown probe attempt even
after later probes successfully run and return false, so a final TimeoutException can incorrectly
include an unrelated InnerException and mislead debugging/handlers inspecting InnerException.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R870-885]

+                var isReady = await probe(device, linkedToken)
+                    .WaitAsync(linkedToken)
+                    .ConfigureAwait(false);
+
+                if (isReady)
+                {
+                    if (probesExecuted > 1)
+                    {
+                        _logger.LogDebug(
+                            "Device became application-ready on probe attempt {Attempt}.",
+                            probesExecuted);
+                    }
+                    return;
+                }
+                _logger.LogDebug("Application-ready probe returned false on attempt {Attempt}; will retry.", probesExecuted);
+            }
Evidence
The code only sets lastProbeException when the probe throws/cancels, but it is used as the
TimeoutException inner exception in multiple timeout paths; since it is never cleared on a
successful probe returning false, a later timeout can incorrectly surface an earlier exception that
is no longer representative of the most recent probe outcomes.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[842-860]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[864-885]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[903-916]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[924-928]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`WaitForApplicationReadyAsync` stores the most recent probe-thrown exception in `lastProbeException` and then attaches it as `InnerException` when timing out. However, when a probe invocation *succeeds* (i.e., returns `false`), the code does not clear `lastProbeException`. This can cause a timeout after many later `false` results to still report an earlier, unrelated exception as the timeout’s `InnerException`.

### Issue Context
This impacts the new post-reconnect readiness probe loop and makes timeout failures misleading (and potentially mis-handled) when callers look at `TimeoutException.InnerException`.

### Fix
Clear `lastProbeException` after a successful probe invocation (both `true` and `false` paths), or alternatively clear it immediately before each probe attempt and only set it in the catch blocks.

### Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[864-885]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Misleading probe-timeout message ✓ Resolved 🐞 Bug ◔ Observability
Description
On readiness timeout while awaiting the probe, the thrown TimeoutException claims the probe “was
canceled by the timeout while running,” but the code comments explicitly note the probe may ignore
cancellation (meaning only the await is canceled). This can mislead troubleshooting by implying the
probe cooperated with cancellation when it may still be executing in the background.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R887-888]

+                    "The readiness probe was canceled by the timeout while running.",
+                    lastProbeException);
Evidence
The comment documents that WaitAsync(linkedToken) is used so the caller stops waiting even if the
probe ignores cancellation, but the timeout message states the probe itself was canceled, which is
not guaranteed by this mechanism.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[862-869]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[885-888]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`WaitForApplicationReadyAsync` throws a `TimeoutException` whose message says the readiness probe "was canceled by the timeout while running." However, the surrounding code/commentary makes it clear the probe may ignore cancellation and continue running; the timeout is enforced by canceling the *wait*, not necessarily the probe itself.

This message should be reworded to avoid stating that the probe itself was canceled.

### Issue Context
The code uses `WaitAsync(linkedToken)` specifically to stop awaiting even if the probe ignores the cancellation token.

### Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[883-888]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (3)
5. Transport-specific readiness timeout text ✓ Resolved 📎 Requirement gap ⚙ Maintainability
Description
The new TimeoutException messages in WaitForApplicationReadyAsync hardcode serial/PIC32 wording
(e.g., "serial transport reopened" / "after PIC32 reconnect"), which is misleading for non-serial
transports and undermines a transport-agnostic readiness design.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R856-857]

+                    "The serial transport reopened but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.",
+                    lastProbeException);
Evidence
PR Compliance ID 3 requires a transport-agnostic readiness design; the new timeout messages
explicitly reference serial transport and PIC32 reconnect, which couples the readiness path’s
diagnostics to a specific transport/device.

Maintain transport-agnostic readiness verification design
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[856-857]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[906-907]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The post-reconnect readiness timeout exceptions include transport- and device-specific phrasing (serial/PIC32). This weakens the transport-agnostic readiness concept and can confuse callers on other transports.

## Issue Context
Compliance requires the readiness mechanism to remain transport-agnostic where possible. Exception text is part of the public surface (diagnostics) and should avoid implying a specific transport.

## Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[856-857]
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[906-907]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Probe errors lose root cause ✓ Resolved 🐞 Bug ◔ Observability
Description
WaitForApplicationReadyAsync swallows all non-cancellation exceptions from the caller-provided probe
and continues retrying, so repeated deterministic probe failures can end up reported only as a
TimeoutException without retaining the underlying exception that caused the probe to fail. If Debug
logs are off, the real error context is effectively lost.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R890-896]

+            catch (Exception ex) when (ex is not OperationCanceledException)
+            {
+                _logger.LogDebug(
+                    ex,
+                    "Application-ready probe threw on attempt {Attempt}; treating as not-ready and retrying.",
+                    attempt);
+            }
Evidence
The probe exception is explicitly caught and treated as not-ready; later timeout exceptions are
thrown without carrying forward that exception context.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[851-896]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[846-849]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[884-889]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`WaitForApplicationReadyAsync` catches all non-`OperationCanceledException` exceptions from `probe(...)`, logs them at Debug, and retries. On eventual timeout, the thrown `TimeoutException` does not include the last probe exception as an inner exception or in its message.

This can make failures hard to diagnose in environments where Debug logs are not collected.

### Issue Context
This is specifically about improving diagnostics without changing the retry semantics (still treat probe failures as "not ready").

### Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[823-907]

### Suggested fix
Track the most recent probe exception (e.g., `Exception? lastProbeException`) when catching exceptions from the probe. When throwing `TimeoutException` due to readiness timeout (in all timeout branches), include `lastProbeException` as the inner exception (or append a short summary to the message).

Optionally, consider logging the first probe exception at `LogWarning` (once) and subsequent ones at `LogDebug` to avoid losing signal while still preventing log spam.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. Conflicting timeout budgets ✓ Resolved 🐞 Bug ☼ Reliability
Description
The readiness probe enforces its own PostReconnectReadinessTimeout, but the whole reconnect step is
also bounded by JumpingToApplicationTimeout via ExecuteWithStateTimeoutAsync. If
JumpingToApplicationTimeout is shorter (or mostly consumed by serial reconnect), the probe can be
aborted by the outer state-timeout and callers will see a generic JumpingToApp timeout instead of
the readiness-specific timeout message.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R828-834]

+        var totalTimeout = _options.PostReconnectReadinessTimeout;
+        var retryDelay = _options.PostReconnectReadinessRetryDelay;
+
+        using var timeoutCts = new CancellationTokenSource(totalTimeout);
+        using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(
+            cancellationToken, timeoutCts.Token);
+        var linkedToken = linkedCts.Token;
Evidence
The JumpToApp step is wrapped by a state-level timeout, while the new readiness probe adds another
independent timeout; whichever expires first determines the surfaced failure mode/message.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[294-302]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[799-907]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[973-1037]
src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[55-59]
src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[141-150]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Two independent timeouts bound the same step:
- outer state timeout: `JumpingToApplicationTimeout` (via `ExecuteWithStateTimeoutAsync(JumpingToApp, ...)`)
- inner probe timeout: `PostReconnectReadinessTimeout` (via `new CancellationTokenSource(totalTimeout)`)

If the outer timeout is shorter than (or effectively shorter than) the inner timeout, the readiness probe can be cut off by the state timeout first, yielding a generic state-timeout error and masking the readiness-specific failure.

### Issue Context
The new probe timeout is created inside `WaitForApplicationReadyAsync`, but `JumpToApplicationAndReconnectAsync` is invoked under the JumpingToApp state timeout.

### Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[172-231]
- src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[55-59]
- src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[141-150]

### Suggested fix
In `FirmwareUpdateServiceOptions.Validate()` add a guard when `PostReconnectReadinessProbe != null` to ensure `JumpingToApplicationTimeout` is strictly greater than `PostReconnectReadinessTimeout` (and ideally includes additional headroom for the serial reconnect portion, e.g. `PollInterval` or a small fixed buffer). Throw a clear `ArgumentOutOfRangeException` describing the required relationship so misconfiguration fails fast with an actionable message.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

8. Misleading probe attempt number ✓ Resolved 🐞 Bug ◔ Observability
Description
The new TimeoutException messages include "(attempt {attempt})", but attempt is incremented before
the readiness probe runs, so a timeout during the initial cancellation check can report attempt 1
even though 0 probes executed. This can mislead debugging/telemetry when diagnosing readiness
timeouts.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R855-856]

+                    $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
+                    "The transport reconnected but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.",
Evidence
The timeout message uses the loop variable attempt while the loop increments attempt before
checking cancellation and before running the probe, so the reported attempt can be ahead of the
number of probe executions in timeout-at-start scenarios.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[844-858]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`WaitForApplicationReadyAsync` includes an "attempt" counter in TimeoutException messages, but the counter is incremented before the probe is actually executed. In the edge case where the timeout has already fired (or fires right before the first probe), the error message can claim “attempt 1” even though no probe ran, which is misleading for troubleshooting.

### Issue Context
The timeout message now interpolates `attempt` (e.g., `(... attempt {attempt})`). The loop increments `attempt` at the top, and only later invokes the probe.

### Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[844-858]

### Suggested fix
Track *probes executed* separately (increment immediately before/after invoking the probe) and report that number in the timeout messages, or move the attempt increment to just before the probe call so the reported attempt always corresponds to a probe execution.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


9. Readiness wait not surfaced ✓ Resolved 🐞 Bug ◔ Observability
Description
When PostReconnectReadinessProbe is enabled, long waits (up to 30s by default) occur inside
JumpingToApp without any progress/state notification beyond Debug logs, making the update appear
hung to observers relying on progress events. This reduces operational visibility during the new
wait phase.
Code

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[R817-820]

+        if (_options.PostReconnectReadinessProbe is { } probe)
+        {
+            await WaitForApplicationReadyAsync(device, probe, cancellationToken).ConfigureAwait(false);
+        }
Evidence
The JumpingToApp progress is reported before calling JumpToApplicationAndReconnectAsync, but the
readiness wait adds additional wall-clock time without additional progress reporting or info-level
logging.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[294-304]
src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[799-907]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The readiness-wait loop can take significant time, but it does not emit any info-level log or progress update indicating that the service is intentionally waiting for application readiness.

### Issue Context
`TransitionToState` does not emit a `StateChanged` event when the state stays the same, and `WaitForApplicationReadyAsync` does not call `ReportProgress`, so observers may not see any visible signal during the wait.

### Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[799-821]
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[823-907]

### Suggested fix
Add a one-time `LogInformation` (or `LogDebug` if preferred, but once) when entering the readiness wait and when it succeeds (including elapsed time / attempt count). If you want this to be visible to progress listeners, consider plumbing `progress` into `JumpToApplicationAndReconnectAsync` so it can `ReportProgress` with an updated `CurrentOperation` like "Waiting for application readiness…".

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Code Review by Qodo

Grey Divider

New Review Started

This review has been superseded by a new analysis

Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

PR Code Suggestions ✨

Latest suggestions up to 26b19fe

Warning

/improve is deprecated. Use /agentic_review instead (removal date not yet scheduled).

CategorySuggestion                                                                                                                                    Impact
General
Use monotonic elapsed timing

Replace DateTime.UtcNow with Stopwatch for elapsed time calculations.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs [34-93]

-var waitStart = DateTime.UtcNow;
+var waitStopwatch = Stopwatch.StartNew();
 ...
 if (isReady)
 {
-    var elapsed = DateTime.UtcNow - waitStart;
+    var elapsed = waitStopwatch.Elapsed;
     _logger.LogInformation(
         "Device became application-ready after {Elapsed} on probe attempt {Attempt}.",
         elapsed,
         probesExecuted);
     return;
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 5

__

Why: Using Stopwatch is a best practice for elapsed time measurement as it avoids inaccuracies and potential negative durations caused by system clock adjustments.

Low
Possible issue
Preserve correct timeout causality

Do not include the OperationCanceledException as an inner exception for
TimeoutException when the wait deadline fires.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs [96-118]

 catch (OperationCanceledException ex) when (!cancellationToken.IsCancellationRequested)
 {
     // Two cases reach here:
     // 1. The wait deadline fired (timeoutCts canceled) — surface
     //    as TimeoutException so callers see the readiness budget.
     // 2. The probe itself threw OperationCanceledException for
     //    some unrelated reason (its own internal CTS, etc). That
     //    must NOT crash the update loop — treat it as a probe
     //    failure and retry, same as any other thrown exception.
     if (timeoutCts.IsCancellationRequested)
     {
         throw new TimeoutException(
             $"Device did not become application-ready within {totalTimeout} (probes executed: {probesExecuted}). " +
             "The wait for the readiness probe was canceled by the timeout — note the probe may ignore cancellation and continue running in the background.",
-            lastProbeException ?? ex);
+            lastProbeException);
     }
 
     lastProbeException = ex;
     _logger.LogDebug(
         ex,
         "Application-ready probe was canceled on attempt {Attempt}; treating as not-ready and retrying.",
         probesExecuted);
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 5

__

Why: Omitting the cancellation exception as the inner exception of the TimeoutException provides cleaner exception chaining and ensures consistency with similar timeout handling at the beginning of the loop.

Low
  • Update

Previous suggestions

Suggestions up to commit ea82fd1
CategorySuggestion                                                                                                                                    Impact
General
Fix probe-attempt accounting

Modify the probe attempt counter to increment only after the probe task is
established. Use a local variable to update the counter just before awaiting the
task.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs [42-67]

 var probesExecuted = 0;
 while (true)
 {
     try
     {
         linkedToken.ThrowIfCancellationRequested();
     }
     catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
     {
         throw new TimeoutException(
             $"Device did not become application-ready within {totalTimeout} (probes executed: {probesExecuted}). " +
             "The transport reconnected but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.",
             lastProbeException);
     }
 
+    var attempt = probesExecuted + 1;
     try
     {
-        probesExecuted++;
+        var probeTask = probe(device, linkedToken);
+        probesExecuted = attempt;
+
         // WaitAsync(linkedToken) enforces the timeout deadline even
         // when the probe ignores its own CancellationToken and would
         // otherwise hang or return after the budget elapses. When
         // the deadline fires, WaitAsync throws OperationCanceledException
         // immediately — we don't keep waiting for the rogue probe.
-        var isReady = await probe(device, linkedToken)
+        var isReady = await probeTask
             .WaitAsync(linkedToken)
             .ConfigureAwait(false);
Suggestion importance[1-10]: 1

__

Why: The C# await mechanism synchronously evaluates the probe() call before yielding, so there is no gap where cancellation could bypass the invocation, making this change unnecessary.

Low
Possible issue
Prevent busy-loop on zero delay

Add a defensive check to clamp negative or zero retry delays to a minimum of 1
millisecond. This prevents a busy-loop scenario if the delay validation was
somehow skipped.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs [113-123]

 try
 {
-    await Task.Delay(retryDelay, linkedToken).ConfigureAwait(false);
+    var delay = retryDelay <= TimeSpan.Zero ? TimeSpan.FromMilliseconds(1) : retryDelay;
+    await Task.Delay(delay, linkedToken).ConfigureAwait(false);
 }
 catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
 {
     throw new TimeoutException(
         $"Device did not become application-ready within {totalTimeout} (probes executed: {probesExecuted}). " +
         "The transport reconnected but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.",
         lastProbeException);
 }
Suggestion importance[1-10]: 1

__

Why: The PR already adds explicit validation via ValidatePositive in FirmwareUpdateServiceOptions.Validate to ensure the retry delay is positive, making defensive clamping in the consumer redundant and an anti-pattern.

Low
✅ Suggestions up to commit 5ac58c0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Make validations truly opt-in
Suggestion Impact:The commit wrapped the readiness timeout/retry-delay positive validations and the JumpingToApplicationTimeout cross-check inside an `if (PostReconnectReadinessProbe is not null)` block, making readiness validations opt-in when the probe is configured.

code diff:

-        ValidatePositive(PostReconnectReadinessTimeout, nameof(PostReconnectReadinessTimeout));
-        ValidatePositive(PostReconnectReadinessRetryDelay, nameof(PostReconnectReadinessRetryDelay));
-
-        // The whole JumpToApp step is wrapped by JumpingToApplicationTimeout,
-        // so a probe budget that meets or exceeds it would always be cut off
-        // by the outer state-timeout — surfacing a generic JumpingToApp error
-        // instead of the readiness-specific message. Reject the configuration
-        // up front so misconfigurations fail fast.
-        if (PostReconnectReadinessProbe is not null &&
-            PostReconnectReadinessTimeout >= JumpingToApplicationTimeout)
-        {
-            throw new ArgumentOutOfRangeException(
-                nameof(PostReconnectReadinessTimeout),
-                PostReconnectReadinessTimeout,
-                $"{nameof(PostReconnectReadinessTimeout)} must be strictly less than {nameof(JumpingToApplicationTimeout)} when {nameof(PostReconnectReadinessProbe)} is set, " +
-                "otherwise the outer state-timeout fires first and masks the readiness-specific timeout.");
+
+        // The readiness options only matter when a probe is configured —
+        // gate every readiness validation behind that, both the positive
+        // checks and the cross-property constraint, so callers that don't
+        // use the probe never have to think about these timeouts.
+        if (PostReconnectReadinessProbe is not null)
+        {
+            ValidatePositive(PostReconnectReadinessTimeout, nameof(PostReconnectReadinessTimeout));
+            ValidatePositive(PostReconnectReadinessRetryDelay, nameof(PostReconnectReadinessRetryDelay));
+
+            // The whole JumpToApp step is wrapped by JumpingToApplicationTimeout,
+            // so a probe budget that meets or exceeds it would always be cut off
+            // by the outer state-timeout — surfacing a generic JumpingToApp error
+            // instead of the readiness-specific message. Reject the configuration
+            // up front so misconfigurations fail fast.
+            if (PostReconnectReadinessTimeout >= JumpingToApplicationTimeout)
+            {
+                throw new ArgumentOutOfRangeException(
+                    nameof(PostReconnectReadinessTimeout),
+                    PostReconnectReadinessTimeout,
+                    $"{nameof(PostReconnectReadinessTimeout)} must be strictly less than {nameof(JumpingToApplicationTimeout)} when {nameof(PostReconnectReadinessProbe)} is set, " +
+                    "otherwise the outer state-timeout fires first and masks the readiness-specific timeout.");
+            }
         }

Gate the validation of PostReconnectReadinessTimeout and
PostReconnectReadinessRetryDelay behind a check for whether the readiness probe
is configured.

src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs [52-68]

-ValidatePositive(PostReconnectReadinessTimeout, nameof(PostReconnectReadinessTimeout));
-ValidatePositive(PostReconnectReadinessRetryDelay, nameof(PostReconnectReadinessRetryDelay));
+if (PostReconnectReadinessProbe is not null)
+{
+    ValidatePositive(PostReconnectReadinessTimeout, nameof(PostReconnectReadinessTimeout));
+    ValidatePositive(PostReconnectReadinessRetryDelay, nameof(PostReconnectReadinessRetryDelay));
 
-// The whole JumpToApp step is wrapped by JumpingToApplicationTimeout,
-// so a probe budget that meets or exceeds it would always be cut off
-// by the outer state-timeout — surfacing a generic JumpingToApp error
-// instead of the readiness-specific message. Reject the configuration
-// up front so misconfigurations fail fast.
-if (PostReconnectReadinessProbe is not null &&
-    PostReconnectReadinessTimeout >= JumpingToApplicationTimeout)
-{
-    throw new ArgumentOutOfRangeException(
-        nameof(PostReconnectReadinessTimeout),
-        PostReconnectReadinessTimeout,
-        $"{nameof(PostReconnectReadinessTimeout)} must be strictly less than {nameof(JumpingToApplicationTimeout)} when {nameof(PostReconnectReadinessProbe)} is set, " +
-        "otherwise the outer state-timeout fires first and masks the readiness-specific timeout.");
+    // The whole JumpToApp step is wrapped by JumpingToApplicationTimeout,
+    // so a probe budget that meets or exceeds it would always be cut off
+    // by the outer state-timeout — surfacing a generic JumpingToApp error
+    // instead of the readiness-specific message. Reject the configuration
+    // up front so misconfigurations fail fast.
+    if (PostReconnectReadinessTimeout >= JumpingToApplicationTimeout)
+    {
+        throw new ArgumentOutOfRangeException(
+            nameof(PostReconnectReadinessTimeout),
+            PostReconnectReadinessTimeout,
+            $"{nameof(PostReconnectReadinessTimeout)} must be strictly less than {nameof(JumpingToApplicationTimeout)} when {nameof(PostReconnectReadinessProbe)} is set, " +
+            "otherwise the outer state-timeout fires first and masks the readiness-specific timeout.");
+    }
 }
Suggestion importance[1-10]: 5

__

Why: While newly added properties won't automatically break existing callers, it is good practice to only validate options when the related feature (the probe) is actually enabled.

Low
Suggestions up to commit 5ac58c0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Track attempts and observe probe tasks

Declare the probe task as a local variable before awaiting it to allow observing
potential unobserved task exceptions.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs [59-67]

+var probeTask = probe(device, linkedToken);
 probesExecuted++;
+
 // WaitAsync(linkedToken) enforces the timeout deadline even
 // when the probe ignores its own CancellationToken and would
 // otherwise hang or return after the budget elapses. When
 // the deadline fires, WaitAsync throws OperationCanceledException
 // immediately — we don't keep waiting for the rogue probe.
-var isReady = await probe(device, linkedToken)
+var isReady = await probeTask
     .WaitAsync(linkedToken)
     .ConfigureAwait(false);
Suggestion importance[1-10]: 2

__

Why: The suggestion separates the task declaration but fails to observe the exception itself, and any variable declared here would be out of scope in the subsequent catch block.

Low
✅ Suggestions up to commit 881f65a
CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle probe cancellations safely
Suggestion Impact:Updated the OperationCanceledException catch to capture non-caller cancellations, convert timeout-triggered cancellations into TimeoutException, and treat probe-thrown cancellations as retryable failures with logging.

code diff:

-            catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
-            {
-                throw new TimeoutException(
-                    $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
-                    "The wait for the readiness probe was canceled by the timeout — note the probe may ignore cancellation and continue running in the background.",
-                    lastProbeException);
+            catch (OperationCanceledException ex) when (!cancellationToken.IsCancellationRequested)
+            {
+                // Two cases reach here:
+                // 1. The wait deadline fired (timeoutCts canceled) — surface
+                //    as TimeoutException so callers see the readiness budget.
+                // 2. The probe itself threw OperationCanceledException for
+                //    some unrelated reason (its own internal CTS, etc). That
+                //    must NOT crash the update loop — treat it as a probe
+                //    failure and retry, same as any other thrown exception.
+                if (timeoutCts.IsCancellationRequested)
+                {
+                    throw new TimeoutException(
+                        $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
+                        "The wait for the readiness probe was canceled by the timeout — note the probe may ignore cancellation and continue running in the background.",
+                        lastProbeException ?? ex);
+                }
+
+                lastProbeException = ex;
+                _logger.LogDebug(
+                    ex,
+                    "Application-ready probe was canceled on attempt {Attempt}; treating as not-ready and retrying.",
+                    attempt);
             }

Handle OperationCanceledException thrown independently by the probe to prevent
unhandled exceptions and allow retries.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs [55-92]

 try
 {
     // WaitAsync(linkedToken) enforces the timeout deadline even
     // when the probe ignores its own CancellationToken and would
     // otherwise hang or return after the budget elapses. When
     // the deadline fires, WaitAsync throws OperationCanceledException
     // immediately — we don't keep waiting for the rogue probe.
     var isReady = await probe(device, linkedToken)
         .WaitAsync(linkedToken)
         .ConfigureAwait(false);
 
     if (isReady)
     {
         if (attempt > 1)
         {
             _logger.LogDebug(
                 "Device became application-ready on probe attempt {Attempt}.",
                 attempt);
         }
         return;
     }
     _logger.LogDebug("Application-ready probe returned false on attempt {Attempt}; will retry.", attempt);
 }
-catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
+catch (OperationCanceledException ex) when (!cancellationToken.IsCancellationRequested)
 {
-    throw new TimeoutException(
-        $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
-        "The wait for the readiness probe was canceled by the timeout — note the probe may ignore cancellation and continue running in the background.",
-        lastProbeException);
+    if (timeoutCts.IsCancellationRequested)
+    {
+        throw new TimeoutException(
+            $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
+            "The wait for the readiness probe was canceled by the timeout — note the probe may ignore cancellation and continue running in the background.",
+            lastProbeException ?? ex);
+    }
+
+    lastProbeException = ex;
+    _logger.LogDebug(
+        ex,
+        "Application-ready probe was canceled on attempt {Attempt}; treating as not-ready and retrying.",
+        attempt);
 }
 catch (Exception ex) when (ex is not OperationCanceledException)
 {
     lastProbeException = ex;
     _logger.LogDebug(
         ex,
         "Application-ready probe threw on attempt {Attempt}; treating as not-ready and retrying.",
         attempt);
 }
Suggestion importance[1-10]: 10

__

Why: Identifies and fixes a crucial bug where a probe throwing OperationCanceledException independently would skip the existing catch blocks and crash the update loop.

High
✅ Suggestions up to commit 7dc8bdf
CategorySuggestion                                                                                                                                    Impact
General
Improve timeout message precision
Suggestion Impact:Updated TimeoutException messages to interpolate {totalTimeout} (preserving sub-second precision) rather than rounding to whole seconds; related timeout messages were also reworded while applying this change.

code diff:

                 throw new TimeoutException(
-                    $"Device did not become application-ready within {totalTimeout.TotalSeconds:F0}s after PIC32 reconnect (attempt {attempt}). " +
-                    "The serial transport reopened but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.");
+                    $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
+                    "The transport reconnected but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.",
+                    lastProbeException);
             }
 
             try
             {
-                var isReady = await probe(device, linkedToken).ConfigureAwait(false);
-
-                // Re-check the budget after the await: a probe that ignores
-                // its CancellationToken could otherwise return true after
-                // the timeout has elapsed and bypass the budget. Caller's
-                // CT propagates as OperationCanceledException out of this
-                // method (caught at the same handler below); only the
-                // timeout-CT case is reinterpreted as TimeoutException.
-                try
-                {
-                    linkedToken.ThrowIfCancellationRequested();
-                }
-                catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
-                {
-                    throw new TimeoutException(
-                        $"Device did not become application-ready within {totalTimeout.TotalSeconds:F0}s after PIC32 reconnect (attempt {attempt}). " +
-                        "The readiness probe ignored cancellation and returned after the deadline.");
-                }
+                // WaitAsync(linkedToken) enforces the timeout deadline even
+                // when the probe ignores its own CancellationToken and would
+                // otherwise hang or return after the budget elapses. When
+                // the deadline fires, WaitAsync throws OperationCanceledException
+                // immediately — we don't keep waiting for the rogue probe.
+                var isReady = await probe(device, linkedToken)
+                    .WaitAsync(linkedToken)
+                    .ConfigureAwait(false);
 
                 if (isReady)
                 {
@@ -881,14 +880,32 @@
                 }
                 _logger.LogDebug("Application-ready probe returned false on attempt {Attempt}; will retry.", attempt);
             }
-            catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
-            {
-                throw new TimeoutException(
-                    $"Device did not become application-ready within {totalTimeout.TotalSeconds:F0}s after PIC32 reconnect (attempt {attempt}). " +
-                    "The readiness probe was canceled by the timeout while running.");
+            catch (OperationCanceledException ex) when (!cancellationToken.IsCancellationRequested)
+            {
+                // Two cases reach here:
+                // 1. The wait deadline fired (timeoutCts canceled) — surface
+                //    as TimeoutException so callers see the readiness budget.
+                // 2. The probe itself threw OperationCanceledException for
+                //    some unrelated reason (its own internal CTS, etc). That
+                //    must NOT crash the update loop — treat it as a probe
+                //    failure and retry, same as any other thrown exception.
+                if (timeoutCts.IsCancellationRequested)
+                {
+                    throw new TimeoutException(
+                        $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
+                        "The wait for the readiness probe was canceled by the timeout — note the probe may ignore cancellation and continue running in the background.",
+                        lastProbeException ?? ex);
+                }
+
+                lastProbeException = ex;
+                _logger.LogDebug(
+                    ex,
+                    "Application-ready probe was canceled on attempt {Attempt}; treating as not-ready and retrying.",
+                    attempt);
             }
             catch (Exception ex) when (ex is not OperationCanceledException)
             {
+                lastProbeException = ex;
                 _logger.LogDebug(
                     ex,
                     "Application-ready probe threw on attempt {Attempt}; treating as not-ready and retrying.",
@@ -902,7 +919,9 @@
             catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
             {
                 throw new TimeoutException(
-                    $"Device did not become application-ready within {totalTimeout.TotalSeconds:F0}s after PIC32 reconnect (attempt {attempt}).");
+                    $"Device did not become application-ready within {totalTimeout} (attempt {attempt}). " +
+                    "The transport reconnected but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.",
+                    lastProbeException);
             }

Change the timeout message formatting to preserve sub-second precision instead
of rounding to whole seconds.

src/Daqifi.Core/Firmware/FirmwareUpdateService.cs [41-43]

 throw new TimeoutException(
-    $"Device did not become application-ready within {totalTimeout.TotalSeconds:F0}s after PIC32 reconnect (attempt {attempt}). " +
+    $"Device did not become application-ready within {totalTimeout} after PIC32 reconnect (attempt {attempt}). " +
     "The serial transport reopened but the readiness probe never returned true; the device may still be initializing or the firmware may have failed to start.");
Suggestion importance[1-10]: 5

__

Why: While minor, preserving sub-second precision in the timeout message improves diagnostic clarity, especially since tests and callers may use sub-second values.

Low

…ll-behaved probes

Defensive cancellation re-check after the await: a probe that ignores
its CancellationToken could otherwise return true after the timeout
elapsed and slip past the budget. The post-await check forces the
TimeoutException path even for that case.

Caller-cancellation semantics preserved: the inner try/catch only
reinterprets the timeout-CT case; OperationCanceledException from the
caller-CT propagates out unchanged via the outer catch.
@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Code Review by Qodo

Grey Divider

New Review Started

This review has been superseded by a new analysis

Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

Persistent suggestions updated to latest commit 7dc8bdf

…sion in timeout messages

Replaced {totalTimeout.TotalSeconds:F0}s with {totalTimeout} across
all 4 readiness-probe TimeoutException messages. The :F0 formatter
rounded sub-second timeouts to "0s" — including the test value of
150ms — which was unhelpful for diagnostics. {TimeSpan} formats as
hh:mm:ss.fffffff and preserves precision.
@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Persistent review updated to latest commit cff6e8f

@qodo-code-review
Copy link
Copy Markdown

Persistent suggestions updated to latest commit cff6e8f

…t enforcement

Replaced the post-await cancellation re-check from pass 2 with
Task.WaitAsync(linkedToken) on the probe call itself. WaitAsync is
the stronger primitive: when the timeout fires, it throws
OperationCanceledException immediately instead of waiting for the
rogue probe to return. The post-await re-check was defensive but
still required the probe to complete; WaitAsync short-circuits.

Skipped a redundant inline option-validation suggestion (already
covered by Validate() in FirmwareUpdateServiceOptions).
Comment on lines +817 to +820
if (_options.PostReconnectReadinessProbe is { } probe)
{
await WaitForApplicationReadyAsync(device, probe, cancellationToken).ConfigureAwait(false);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Readiness probe is optional 📎 Requirement gap ☼ Reliability

FirmwareUpdateService only waits for application readiness when PostReconnectReadinessProbe is
set; otherwise it proceeds immediately after serial reconnect, which can still return a half-started
device. This keeps the failure mode the checklist is trying to eliminate and can still force
downstream integrations to add their own readiness polling.
Agent Prompt
## Issue description
The post-reconnect readiness wait is currently opt-in (`PostReconnectReadinessProbe` is nullable and gated), so the firmware update flow can still proceed immediately after reconnect without confirming the application is ready.

## Issue Context
Checklist IDs 1 and 2 require Core’s firmware update lifecycle to handle post-reconnect application readiness so reconnect completion isn’t based solely on transport reopening, and downstream callers don’t need their own readiness loops.

## Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[810-820]
- src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[124-150]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Persistent review updated to latest commit eb28b8d

…t cause in TimeoutException

Qodo flagged that deterministic probe failures lose their root cause:
the catch swallows the probe exception, retries, and eventually
throws TimeoutException without the underlying error attached. With
Debug logs off the real cause is invisible.

Now captures the most-recent probe exception and attaches it as
InnerException to all 3 timeout-path TimeoutExceptions. Steady-state
behavior unchanged; observability improves for the failure case.

Skipped 3 other findings:
- "Readiness probe is optional" — defensible design per the issue's
  explicit "callers can provide a readiness probe" wording; making
  it always-on requires Core to know what "ready" means, which it
  doesn't.
- "Conflicting timeout budgets" — readiness timeout (30s default)
  fits inside JumpingToApplicationTimeout (45s default) by design;
  documented in the existing options docstrings.
- "Readiness wait not surfaced" — observability nice-to-have; out
  of scope for this fix.
@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Persistent review updated to latest commit 32e6f91

@qodo-code-review
Copy link
Copy Markdown

Persistent suggestions updated to latest commit 32e6f91

…licationTimeout interaction (PR #200)

Address Qodo finding 'Conflicting timeout budgets': the readiness
probe's budget runs inside the JumpingToApp state, so users who raise
the readiness timeout near/above JumpingToApplicationTimeout will see
the outer state-timeout fire first. Default values (45s state, 30s
readiness) leave headroom for reconnect; documented the interaction
explicitly so configuration mistakes are visible at the call site.

Skipping the remaining 2 findings as design choices:
- 'Readiness probe is optional' — the issue's own wording specifies
  caller-provided probe; making it always-on would require Core to
  know what 'ready' means for an arbitrary device.
- 'Readiness wait not surfaced' — observability nice-to-have; out
  of scope for this fix.
@cptkoolbeenz
Copy link
Copy Markdown
Member Author

Convergence summary (Qodo /agentic_review pass 5, after commit 89657ba):

Persistent findings + dispositions:

  1. Readiness probe is optional — keep as-is. The issue's "Possible approaches" list explicitly includes "extend the reconnect contract so callers can provide a readiness probe", which is what this PR implements. Making the probe always-on with a default would require Core to know what "ready" means for an arbitrary device, which it doesn't (the streaming-device interface has no generic ping). Callers that need the wait set the probe; callers that don't get the legacy fast-complete behavior they had before. This is the pattern desktop already uses.
  2. Conflicting timeout budgets — addressed via docs in 89657ba. The readiness budget runs inside the JumpingToApp state, so configuration that raises it above the state timeout will see the outer fire first. Default values (45s state, 30s readiness) leave ample headroom; the docstring now spells this out so misconfiguration is visible at the call site.
  3. Probe errors lose root cause ✅ resolved in 32e6f91 — TimeoutException now carries the most-recent probe exception as InnerException.
  4. Readiness wait not surfaced — observability nice-to-have. Out of scope for this fix; would require plumbing periodic progress reports during the wait, which adds complexity for a secondary concern. Filed as a follow-up if someone needs it.

Test gate: 893/895 (2 hardware skips). CI green.

Ready for review.

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Persistent review updated to latest commit ea82fd1

@qodo-code-review
Copy link
Copy Markdown

Persistent suggestions updated to latest commit ea82fd1

When the probe throws on attempt N then later attempts return false
until the readiness budget expires, the TimeoutException carried the
stale exception from attempt N as InnerException — misleading
debugging and any handler that inspects InnerException.

Clear lastProbeException whenever a probe invocation completes
normally (true OR false), so a subsequent timeout only attaches a
probe-thrown exception when it's actually the most recent outcome.
Locked in with a focused test.
@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Persistent review updated to latest commit b089763

@qodo-code-review
Copy link
Copy Markdown

PR Code Suggestions ✨

Warning

/improve is deprecated. Use /agentic_review instead (removed after 2026-05-31).

No code suggestions found for the PR.

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Persistent review updated to latest commit b089763

@qodo-code-review
Copy link
Copy Markdown

PR Code Suggestions ✨

Warning

/improve is deprecated. Use /agentic_review instead (removed after 2026-05-31).

No code suggestions found for the PR.

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@qodo-code-review
Copy link
Copy Markdown

PR Code Suggestions ✨

Warning

/improve is deprecated. Use /agentic_review instead (removed after 2026-05-31).

No code suggestions found for the PR.

Add LogInformation at WaitForApplicationReadyAsync entry and on success
(with elapsed time + attempt count) so observers tailing the log can
distinguish a deliberate readiness poll from a stuck flow. Wait can
take up to PostReconnectReadinessTimeout (default 30s); previously only
Debug-level breadcrumbs existed during that window.
@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/improve

@cptkoolbeenz
Copy link
Copy Markdown
Member Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 12, 2026

Persistent review updated to latest commit 26b19fe

@qodo-code-review
Copy link
Copy Markdown

Persistent suggestions updated to latest commit 26b19fe

tylerkron and others added 2 commits May 14, 2026 20:45
…t (PR #200 follow-up)

SerialStreamTransport.Stream returns _serialPort.BaseStream, which is
a fresh Stream instance after Disconnect() reopens the port. Previously
DaqifiDevice.Disconnect kept _messageProducer / _messageConsumer alive
with references to the old BaseStream, and Connect's "if (... == null)"
guard skipped recreation — so any Send() after a transport reconnect
wrote to the disposed stream and silently no-op'd, while the text
consumer on the new stream saw zero bytes.

Surfaced by PR #200's post-reconnect readiness probe: GetLanChipInfoAsync
returned null on every attempt because the probe's Send went to the
dead stream. Null the producer/consumer in Disconnect so Connect's
existing guard triggers fresh construction against the current Stream.

Regression test fails without this change (Send post-reconnect lands
on the previous, disposed stream rather than the rotated current one).

Note: this does not fully resolve the post-PIC32-reset readiness probe
on macOS hardware — a separate USB CDC re-enumeration race appears to
leave the new SerialPort handle unable to receive bytes despite
IsOpen=true. That's tracked separately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tylerkron
Copy link
Copy Markdown
Contributor

Hardware validation + bundled Core fix

Spent a session validating this PR on real hardware (Nq1 / /dev/cu.usbmodem101, macOS). Three flashes with different probe configurations on top of the merged main branch. Findings:

1. PR mechanism is correct end-to-end

--readiness-trivial probe ((_, _) => Task.FromResult(true)) transitioned the update through JumpingToApp → Complete in 2.3 seconds — confirming the probe→wait→state-transition pipeline works exactly as designed:

[State] Verifying -> JumpingToApp | ... | 03:13:29.9
[Probe] trivial-true
[State] JumpingToApp -> Complete | PIC32 firmware update completed. | 03:13:32.1
Firmware update completed successfully.

2. Stale-stream bug uncovered in DaqifiDevice (bundled in commit e59410e)

The natural probe (GetLanChipInfoAsync) returned null on every attempt across multiple flashes. Trace-listener instrumentation showed Collection complete at ~2000ms, 0 lines for each probe — the text consumer never saw a single byte.

Root cause: SerialStreamTransport.Stream returns _serialPort.BaseStream, which is a brand-new Stream instance after Disconnect() reopens the port. But DaqifiDevice.Disconnect() kept _messageProducer / _messageConsumer alive with references to the old BaseStream, and Connect()'s if (_messageProducer == null) guard skipped recreation. Any Send() after a transport reconnect wrote to a disposed stream and silently no-op'd.

This was latent because no existing internal flow calls SCPI on the same DaqifiStreamingDevice instance after a reconnect — desktop disposes and reconnects fresh. This PR's readiness probe is the first caller to exercise that path, so it surfaced it.

Fix (commit e59410e): null _messageConsumer and _messageProducer in Disconnect() so Connect()'s existing guard triggers fresh construction against the current Stream. Added a regression test (DaqifiDevice_DisconnectThenConnect_SendsToCurrentTransportStream) that uses a stream-rotating mock transport — fails without the fix, passes with it. Full suite: 964/964 passing.

3. Separate USB CDC re-enumeration issue — follow-up

Even with the stale-stream fix, the LAN chip-info probe still times out post-PIC32-reset on macOS. Trace evidence:

Working --lan-chip-info (fresh process) Probe post-firmware-reset
Protobuf consumer stopped at 500ms (busy reading) 0-1ms (idle / errored)
First response at 603ms ✓ (never)
Lines collected 1 0

The new SerialPort opens (IsOpen == true), my fix gives it a fresh producer/consumer bound to the new BaseStream, but no bytes flow back through that handle. A separate process opening the same /dev/cu.usbmodem101 a few seconds later gets a working handle immediately.

This is consistent with a macOS USB CDC re-enumeration race: after PIC32 jumps to application, the host sees the old device disappear and a new one appear with the same path name, but the in-process SerialPort.Open() may bind too soon to a kernel device-node that isn't fully ready. A fresh process binds cleanly.

Mitigation options for a follow-up:

  • Brief grace delay after transport.Connect() succeeds, then re-open the port
  • Probe-side: caller's PostReconnectReadinessProbe could include the close/reopen dance
  • Document the contract: "probe must tolerate the device's post-reset settling period"

Implication for this PR: mechanism is sound, ship it. The probe's value will be most apparent once the follow-up USB CDC issue is also fixed; for now, callers should be aware that on macOS post-PIC32-reset, a probe based on GetLanChipInfoAsync will likely time out even though the device firmware is actually fine.

After PIC32 firmware update jumps to application, the device's USB CDC
interface re-enumerates. On macOS, the first SerialPort.Open() to
succeed inside the re-enum window is a "shadow" handle — IsOpen==true
but writes silently drop and reads see zero bytes. Hardware-validated
on a Nyquist1 (/dev/cu.usbmodem101): the first reconnect succeeded at
~2s with UnauthorizedAccessException rejected for every preceding
attempt. Once that race-winning handle was held, subsequent SCPI
exchanges returned 0 bytes for tens of seconds; a fresh process
opening the same path got a clean handle immediately.

Add PostReconnectStaleHandleDelay option (default 2s) and a dance in
JumpToApplicationAndReconnectAsync that, after WaitForSerialReconnectAsync
succeeds, closes the port, sleeps the delay, then reconnects to obtain
a clean kernel binding. Validated end-to-end with PR #200's readiness
probe: with the dance + a probe that runs InitializeAsync to wake the
post-reset dormant device, the LAN chip-info query returns valid data
and the firmware update transitions to Complete.

Opt out by setting the option to TimeSpan.Zero (Windows, where the
first open is already clean).

Tests:
- UpdateFirmwareAsync_PostReconnectStaleHandleDelay_TriggersExtraDisconnectReconnect
  asserts the dance fires (2 disconnects + 2 reconnects vs baseline 1 each).
- UpdateFirmwareAsync_PostReconnectStaleHandleDelayZero_SkipsExtraDisconnectReconnect
  asserts the opt-out path with TimeSpan.Zero keeps the baseline counts.
- CreateFastOptions sets the delay to Zero so the unit-test suite (which
  doesn't exercise USB CDC) doesn't pay the per-test 2s settling cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tylerkron
Copy link
Copy Markdown
Contributor

Update: full hardware fix landed (commit a5ce969)

Continued hardware debugging revealed a second issue chained behind the first one. Now the readiness probe works end-to-end on macOS.

The complete failure mode

  1. PIC32 firmware update finishes → MCU jumps to application → application firmware boots dormant (LEDs off, WiFi subsystem unpowered, doesn't respond to LAN/WiFi SCPI queries). Confirmed via SYSTem:REboot test: lights go off and stay off until an explicit `SYSTem:POWer:STATe 1` is sent.
  2. PIC32 reset causes USB CDC to re-enumerate. On macOS the host SerialPort opens are rejected with `UnauthorizedAccessException` for ~2s.
  3. The first `SerialPort.Open()` to succeed after that 2s window is a shadow handle: `IsOpen==true`, but writes silently drop and reads see zero bytes. A fresh process opening the same path gets a clean handle immediately.

Two-layer fix

Core (commit a5ce969): New `FirmwareUpdateServiceOptions.PostReconnectStaleHandleDelay` (default 2s). After `WaitForSerialReconnectAsync` succeeds, the firmware service now discards that race-winning handle — `device.Disconnect()` → sleep → `device.Connect()` → clean kernel binding. Opt out with `TimeSpan.Zero` (Windows, where the first open is already clean).

Probe-side responsibility (caller pattern): The post-reset device is dormant. The caller's probe should call `device.InitializeAsync()` first (which sends `SYSTem:POWer:STATe 1` etc.) before issuing the readiness query.

Hardware proof

With both fixes in place, the readiness probe (using `GetLanChipInfoAsync` after `InitializeAsync`) succeeds end-to-end on the Nyquist1:

```
[State] Verifying -> JumpingToApp
[RECONNECT-LOOP] enter; IsConnected=False
[RECONNECT-LOOP] attempt 1-9 failed (UnauthorizedAccessException) — ~2.3s
[RECONNECT-LOOP] connected on attempt 10
[STALE-FIX] Disconnect → sleep 2s → reconnect (clean kernel binding)
[RECONNECT-LOOP] connected on attempt 1 instantly
[Probe] LAN chip-info OK (ChipId=1377184, FW=19.7.7)
[State] JumpingToApp -> Complete
Firmware update completed successfully.
```

Recommended probe shape (for callers, including daqifi-desktop)

```csharp
options.PostReconnectReadinessProbe = async (device, ct) =>
{
if (device is DaqifiStreamingDevice sd)
{
// Wake the post-reset dormant device (TurnDeviceOn + format + echo etc.)
await sd.InitializeAsync().ConfigureAwait(false);
}
if (device is not ILanChipInfoProvider lan) return true;
var info = await lan.GetLanChipInfoAsync(ct).ConfigureAwait(false);
return info != null;
};
```

Tests (suite now at 968/968 passing on net9.0 + net10.0)

  • `UpdateFirmwareAsync_PostReconnectStaleHandleDelay_TriggersExtraDisconnectReconnect` — asserts the dance fires
  • `UpdateFirmwareAsync_PostReconnectStaleHandleDelayZero_SkipsExtraDisconnectReconnect` — asserts opt-out path
  • `CreateFastOptions` test helper sets the delay to Zero so unit tests don't pay the per-test 2s settling cost

After PR #200 hardware validation revealed two ergonomic issues with
the readiness probe contract:

1. PIC32 application firmware boots dormant after the bootloader jumps
   to application (LEDs off, WiFi subsystem unpowered, won't answer
   LAN queries). Callers writing a "natural" probe like
   GetLanChipInfoAsync != null would silently fail for tens of seconds
   because the device needs SYSTem:POWer:STATe 1 (sent by
   InitializeAsync) before it'll respond. That dormant-state knowledge
   shouldn't be required of every probe author.

2. The PostReconnectStaleHandleDelay dance was logged at Debug level
   while the comparable readiness-probe wait is at Information. The
   dance is a significant operation (2+ seconds inside JumpingToApp);
   observers tailing logs should see it.

Changes in JumpToApplicationAndReconnectAsync:
- After the stale-handle dance reopens the port with a clean kernel
  binding, call InitializeAsync on the device if it's a DaqifiDevice
  (skipped for test fakes that don't extend it). Wrapped in try/catch
  with LogWarning so an init failure doesn't mask the eventual probe
  outcome — the probe stays the source of truth for "ready".
- Stale-handle dance logged at Information (was Debug).

Hardware-validated end-to-end: with this change, the example app
probe simplifies to just `GetLanChipInfoAsync != null` (no wake-up
code in the caller). Total post-reconnect time ~9s on Nyquist1
(2.3s race window + 2s stale-handle settle + ~4s init + ~1s probe).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tylerkron
Copy link
Copy Markdown
Contributor

Follow-up: auto-wake + log level (commit 3e4fd88)

Self-review after the previous commit identified two ergonomic gaps. Fixed in 3e4fd88.

1. Auto-wake the dormant device in Core

Previous comment recommended callers do `await sd.InitializeAsync()` in their probe to wake the post-reset device. That works but pushes "device is dormant after PIC32 reset" knowledge onto every probe author — a footgun. Core now handles it automatically after the stale-handle dance:

```csharp
if (device is DaqifiDevice initializableDevice)
{
_logger.LogInformation("Waking post-reset device via InitializeAsync.");
try { await initializableDevice.InitializeAsync().ConfigureAwait(false); }
catch (Exception ex)
{
_logger.LogWarning(ex, "InitializeAsync after reconnect threw; continuing to readiness probe.");
}
}
```

Wrapped in try/catch with LogWarning so an init failure doesn't mask the probe outcome — the probe stays the source of truth for "ready". Skipped for non-DaqifiDevice transports (e.g. test fakes); they're responsible for their own readiness.

Probe pattern simplifies to:

```csharp
options.PostReconnectReadinessProbe = async (device, ct) =>
{
if (device is not ILanChipInfoProvider lan) return true;
var info = await lan.GetLanChipInfoAsync(ct).ConfigureAwait(false);
return info != null;
};
```

2. Stale-handle dance log level: Debug → Information

Matches the readiness probe wait. Dance is a significant operation (2+ seconds inside JumpingToApp); observers tailing logs should see it.

Hardware re-validation

Re-flashed on the Nyquist1 with the simplified "natural" probe:

```
[State] Verifying -> JumpingToApp
[Probe] LAN chip-info OK (ChipId=1377184, FW=19.7.7)
[State] JumpingToApp -> Complete
Firmware update completed successfully.
```

Total post-reconnect time: ~9 seconds (2.3s race window + 2s stale-handle settle + ~4s init + ~1s probe).

Suite status: 968/968 passing (net9.0 + net10.0)

@tylerkron tylerkron merged commit 77c909f into main May 15, 2026
1 check passed
@tylerkron tylerkron deleted the feat/post-reconnect-readiness-probe branch May 15, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FirmwareUpdateService should wait for application readiness after PIC32 reconnect

2 participants