feat(firmware): wait for application readiness after PIC32 reconnect (closes #145)#200
Conversation
…loses #145) Adds an opt-in readiness probe to FirmwareUpdateService so a PIC32 firmware update doesn't transition to Complete until the device is actually ready to answer normal application commands. The serial transport re-enumerates well before the application firmware is up; without this wait, downstream flows (LAN chip-info queries, WiFi prep) hit a half-started device and either fail or have to reimplement their own retry loop in the calling app — exactly the pattern desktop had to work around. API additions on FirmwareUpdateServiceOptions: - PostReconnectReadinessProbe — Func<IStreamingDevice, CancellationToken, Task<bool>>?. Returns true when the application is responsive. Null disables the wait (legacy behavior). - PostReconnectReadinessTimeout (default 30s) — wall-clock budget - PostReconnectReadinessRetryDelay (default 500ms) — between probe attempts When the timeout elapses without the probe returning true, the update transitions to Failed with a clear TimeoutException wrapped in FirmwareUpdateException — NOT a silent Complete on a half-ready device. Test plan: 3 new tests cover (a) probe succeeds on attempt N >1 holding back Complete until ready, (b) timeout raises a properly wrapped FirmwareUpdateException with FailedState=JumpingToApp and the readiness keyword in the inner message, (c) null probe preserves legacy fast-complete path. Full suite 893/895 (2 skipped require live hardware).
|
/improve |
|
/agentic_review |
Review Summary by QodoAdd post-reconnect application readiness probe for PIC32 firmware updates
WalkthroughsDescription• Add optional post-reconnect readiness probe to FirmwareUpdateService • Probe waits for PIC32 application firmware readiness before completing update • Timeout transitions update to Failed state instead of silent completion • Preserve legacy behavior when probe is null (opt-in feature) Diagramflowchart LR
A["Serial Reconnect"] --> B{"Readiness Probe<br/>Configured?"}
B -->|Yes| C["Poll Probe<br/>with Retry"]
B -->|No| D["Complete<br/>Legacy Path"]
C -->|Success| E["Complete"]
C -->|Timeout| F["Failed<br/>with TimeoutException"]
File Changes1. src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs
|
Code Review by Qodo
1. Readiness probe is optional
|
PR Code Suggestions ✨Latest suggestions up to 26b19fe Warning
Previous suggestionsSuggestions up to commit ea82fd1
✅ Suggestions up to commit 5ac58c0
Suggestions up to commit 5ac58c0
✅ Suggestions up to commit 881f65a
✅ Suggestions up to commit 7dc8bdf
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
…ll-behaved probes Defensive cancellation re-check after the await: a probe that ignores its CancellationToken could otherwise return true after the timeout elapsed and slip past the budget. The post-await check forces the TimeoutException path even for that case. Caller-cancellation semantics preserved: the inner try/catch only reinterprets the timeout-CT case; OperationCanceledException from the caller-CT propagates out unchanged via the outer catch.
|
/improve |
|
/agentic_review |
|
Persistent suggestions updated to latest commit 7dc8bdf |
…sion in timeout messages
Replaced {totalTimeout.TotalSeconds:F0}s with {totalTimeout} across
all 4 readiness-probe TimeoutException messages. The :F0 formatter
rounded sub-second timeouts to "0s" — including the test value of
150ms — which was unhelpful for diagnostics. {TimeSpan} formats as
hh:mm:ss.fffffff and preserves precision.
|
/improve |
|
/agentic_review |
|
Persistent review updated to latest commit cff6e8f |
|
Persistent suggestions updated to latest commit cff6e8f |
…t enforcement Replaced the post-await cancellation re-check from pass 2 with Task.WaitAsync(linkedToken) on the probe call itself. WaitAsync is the stronger primitive: when the timeout fires, it throws OperationCanceledException immediately instead of waiting for the rogue probe to return. The post-await re-check was defensive but still required the probe to complete; WaitAsync short-circuits. Skipped a redundant inline option-validation suggestion (already covered by Validate() in FirmwareUpdateServiceOptions).
| if (_options.PostReconnectReadinessProbe is { } probe) | ||
| { | ||
| await WaitForApplicationReadyAsync(device, probe, cancellationToken).ConfigureAwait(false); | ||
| } |
There was a problem hiding this comment.
1. Readiness probe is optional 📎 Requirement gap ☼ Reliability
FirmwareUpdateService only waits for application readiness when PostReconnectReadinessProbe is set; otherwise it proceeds immediately after serial reconnect, which can still return a half-started device. This keeps the failure mode the checklist is trying to eliminate and can still force downstream integrations to add their own readiness polling.
Agent Prompt
## Issue description
The post-reconnect readiness wait is currently opt-in (`PostReconnectReadinessProbe` is nullable and gated), so the firmware update flow can still proceed immediately after reconnect without confirming the application is ready.
## Issue Context
Checklist IDs 1 and 2 require Core’s firmware update lifecycle to handle post-reconnect application readiness so reconnect completion isn’t based solely on transport reopening, and downstream callers don’t need their own readiness loops.
## Fix Focus Areas
- src/Daqifi.Core/Firmware/FirmwareUpdateService.cs[810-820]
- src/Daqifi.Core/Firmware/FirmwareUpdateServiceOptions.cs[124-150]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
/agentic_review |
|
Persistent review updated to latest commit eb28b8d |
…t cause in TimeoutException Qodo flagged that deterministic probe failures lose their root cause: the catch swallows the probe exception, retries, and eventually throws TimeoutException without the underlying error attached. With Debug logs off the real cause is invisible. Now captures the most-recent probe exception and attaches it as InnerException to all 3 timeout-path TimeoutExceptions. Steady-state behavior unchanged; observability improves for the failure case. Skipped 3 other findings: - "Readiness probe is optional" — defensible design per the issue's explicit "callers can provide a readiness probe" wording; making it always-on requires Core to know what "ready" means, which it doesn't. - "Conflicting timeout budgets" — readiness timeout (30s default) fits inside JumpingToApplicationTimeout (45s default) by design; documented in the existing options docstrings. - "Readiness wait not surfaced" — observability nice-to-have; out of scope for this fix.
|
/improve |
|
/agentic_review |
|
Persistent review updated to latest commit 32e6f91 |
|
Persistent suggestions updated to latest commit 32e6f91 |
…licationTimeout interaction (PR #200) Address Qodo finding 'Conflicting timeout budgets': the readiness probe's budget runs inside the JumpingToApp state, so users who raise the readiness timeout near/above JumpingToApplicationTimeout will see the outer state-timeout fire first. Default values (45s state, 30s readiness) leave headroom for reconnect; documented the interaction explicitly so configuration mistakes are visible at the call site. Skipping the remaining 2 findings as design choices: - 'Readiness probe is optional' — the issue's own wording specifies caller-provided probe; making it always-on would require Core to know what 'ready' means for an arbitrary device. - 'Readiness wait not surfaced' — observability nice-to-have; out of scope for this fix.
|
Convergence summary (Qodo /agentic_review pass 5, after commit 89657ba): Persistent findings + dispositions:
Test gate: 893/895 (2 hardware skips). CI green. Ready for review. |
|
/improve |
|
/agentic_review |
|
Persistent review updated to latest commit ea82fd1 |
|
Persistent suggestions updated to latest commit ea82fd1 |
When the probe throws on attempt N then later attempts return false until the readiness budget expires, the TimeoutException carried the stale exception from attempt N as InnerException — misleading debugging and any handler that inspects InnerException. Clear lastProbeException whenever a probe invocation completes normally (true OR false), so a subsequent timeout only attaches a probe-thrown exception when it's actually the most recent outcome. Locked in with a focused test.
|
/improve |
|
/agentic_review |
|
Persistent review updated to latest commit b089763 |
PR Code Suggestions ✨Warning
No code suggestions found for the PR. |
|
/improve |
|
/agentic_review |
|
Persistent review updated to latest commit b089763 |
PR Code Suggestions ✨Warning
No code suggestions found for the PR. |
|
/improve |
PR Code Suggestions ✨Warning
No code suggestions found for the PR. |
Add LogInformation at WaitForApplicationReadyAsync entry and on success (with elapsed time + attempt count) so observers tailing the log can distinguish a deliberate readiness poll from a stuck flow. Wait can take up to PostReconnectReadinessTimeout (default 30s); previously only Debug-level breadcrumbs existed during that window.
|
/improve |
|
/agentic_review |
|
Persistent review updated to latest commit 26b19fe |
|
Persistent suggestions updated to latest commit 26b19fe |
…t (PR #200 follow-up) SerialStreamTransport.Stream returns _serialPort.BaseStream, which is a fresh Stream instance after Disconnect() reopens the port. Previously DaqifiDevice.Disconnect kept _messageProducer / _messageConsumer alive with references to the old BaseStream, and Connect's "if (... == null)" guard skipped recreation — so any Send() after a transport reconnect wrote to the disposed stream and silently no-op'd, while the text consumer on the new stream saw zero bytes. Surfaced by PR #200's post-reconnect readiness probe: GetLanChipInfoAsync returned null on every attempt because the probe's Send went to the dead stream. Null the producer/consumer in Disconnect so Connect's existing guard triggers fresh construction against the current Stream. Regression test fails without this change (Send post-reconnect lands on the previous, disposed stream rather than the rotated current one). Note: this does not fully resolve the post-PIC32-reset readiness probe on macOS hardware — a separate USB CDC re-enumeration race appears to leave the new SerialPort handle unable to receive bytes despite IsOpen=true. That's tracked separately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardware validation + bundled Core fixSpent a session validating this PR on real hardware (Nq1 / 1. PR mechanism is correct end-to-end
2. Stale-stream bug uncovered in
|
Working --lan-chip-info (fresh process) |
Probe post-firmware-reset | |
|---|---|---|
| Protobuf consumer stopped at | 500ms (busy reading) | 0-1ms (idle / errored) |
| First response at | 603ms ✓ | (never) |
| Lines collected | 1 | 0 |
The new SerialPort opens (IsOpen == true), my fix gives it a fresh producer/consumer bound to the new BaseStream, but no bytes flow back through that handle. A separate process opening the same /dev/cu.usbmodem101 a few seconds later gets a working handle immediately.
This is consistent with a macOS USB CDC re-enumeration race: after PIC32 jumps to application, the host sees the old device disappear and a new one appear with the same path name, but the in-process SerialPort.Open() may bind too soon to a kernel device-node that isn't fully ready. A fresh process binds cleanly.
Mitigation options for a follow-up:
- Brief grace delay after
transport.Connect()succeeds, then re-open the port - Probe-side: caller's
PostReconnectReadinessProbecould include the close/reopen dance - Document the contract: "probe must tolerate the device's post-reset settling period"
Implication for this PR: mechanism is sound, ship it. The probe's value will be most apparent once the follow-up USB CDC issue is also fixed; for now, callers should be aware that on macOS post-PIC32-reset, a probe based on GetLanChipInfoAsync will likely time out even though the device firmware is actually fine.
After PIC32 firmware update jumps to application, the device's USB CDC interface re-enumerates. On macOS, the first SerialPort.Open() to succeed inside the re-enum window is a "shadow" handle — IsOpen==true but writes silently drop and reads see zero bytes. Hardware-validated on a Nyquist1 (/dev/cu.usbmodem101): the first reconnect succeeded at ~2s with UnauthorizedAccessException rejected for every preceding attempt. Once that race-winning handle was held, subsequent SCPI exchanges returned 0 bytes for tens of seconds; a fresh process opening the same path got a clean handle immediately. Add PostReconnectStaleHandleDelay option (default 2s) and a dance in JumpToApplicationAndReconnectAsync that, after WaitForSerialReconnectAsync succeeds, closes the port, sleeps the delay, then reconnects to obtain a clean kernel binding. Validated end-to-end with PR #200's readiness probe: with the dance + a probe that runs InitializeAsync to wake the post-reset dormant device, the LAN chip-info query returns valid data and the firmware update transitions to Complete. Opt out by setting the option to TimeSpan.Zero (Windows, where the first open is already clean). Tests: - UpdateFirmwareAsync_PostReconnectStaleHandleDelay_TriggersExtraDisconnectReconnect asserts the dance fires (2 disconnects + 2 reconnects vs baseline 1 each). - UpdateFirmwareAsync_PostReconnectStaleHandleDelayZero_SkipsExtraDisconnectReconnect asserts the opt-out path with TimeSpan.Zero keeps the baseline counts. - CreateFastOptions sets the delay to Zero so the unit-test suite (which doesn't exercise USB CDC) doesn't pay the per-test 2s settling cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update: full hardware fix landed (commit a5ce969)Continued hardware debugging revealed a second issue chained behind the first one. Now the readiness probe works end-to-end on macOS. The complete failure mode
Two-layer fixCore (commit a5ce969): New `FirmwareUpdateServiceOptions.PostReconnectStaleHandleDelay` (default 2s). After `WaitForSerialReconnectAsync` succeeds, the firmware service now discards that race-winning handle — `device.Disconnect()` → sleep → `device.Connect()` → clean kernel binding. Opt out with `TimeSpan.Zero` (Windows, where the first open is already clean). Probe-side responsibility (caller pattern): The post-reset device is dormant. The caller's probe should call `device.InitializeAsync()` first (which sends `SYSTem:POWer:STATe 1` etc.) before issuing the readiness query. Hardware proofWith both fixes in place, the readiness probe (using `GetLanChipInfoAsync` after `InitializeAsync`) succeeds end-to-end on the Nyquist1: ``` Recommended probe shape (for callers, including daqifi-desktop)```csharp Tests (suite now at 968/968 passing on net9.0 + net10.0)
|
After PR #200 hardware validation revealed two ergonomic issues with the readiness probe contract: 1. PIC32 application firmware boots dormant after the bootloader jumps to application (LEDs off, WiFi subsystem unpowered, won't answer LAN queries). Callers writing a "natural" probe like GetLanChipInfoAsync != null would silently fail for tens of seconds because the device needs SYSTem:POWer:STATe 1 (sent by InitializeAsync) before it'll respond. That dormant-state knowledge shouldn't be required of every probe author. 2. The PostReconnectStaleHandleDelay dance was logged at Debug level while the comparable readiness-probe wait is at Information. The dance is a significant operation (2+ seconds inside JumpingToApp); observers tailing logs should see it. Changes in JumpToApplicationAndReconnectAsync: - After the stale-handle dance reopens the port with a clean kernel binding, call InitializeAsync on the device if it's a DaqifiDevice (skipped for test fakes that don't extend it). Wrapped in try/catch with LogWarning so an init failure doesn't mask the eventual probe outcome — the probe stays the source of truth for "ready". - Stale-handle dance logged at Information (was Debug). Hardware-validated end-to-end: with this change, the example app probe simplifies to just `GetLanChipInfoAsync != null` (no wake-up code in the caller). Total post-reconnect time ~9s on Nyquist1 (2.3s race window + 2s stale-handle settle + ~4s init + ~1s probe). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follow-up: auto-wake + log level (commit 3e4fd88)Self-review after the previous commit identified two ergonomic gaps. Fixed in 3e4fd88. 1. Auto-wake the dormant device in Core Previous comment recommended callers do `await sd.InitializeAsync()` in their probe to wake the post-reset device. That works but pushes "device is dormant after PIC32 reset" knowledge onto every probe author — a footgun. Core now handles it automatically after the stale-handle dance: ```csharp Wrapped in try/catch with LogWarning so an init failure doesn't mask the probe outcome — the probe stays the source of truth for "ready". Skipped for non-DaqifiDevice transports (e.g. test fakes); they're responsible for their own readiness. Probe pattern simplifies to: ```csharp 2. Stale-handle dance log level: Debug → Information Matches the readiness probe wait. Dance is a significant operation (2+ seconds inside JumpingToApp); observers tailing logs should see it. Hardware re-validationRe-flashed on the Nyquist1 with the simplified "natural" probe: ``` Total post-reconnect time: ~9 seconds (2.3s race window + 2s stale-handle settle + ~4s init + ~1s probe). Suite status: 968/968 passing (net9.0 + net10.0) |
Summary
Adds an opt-in readiness probe so a PIC32 firmware update doesn't transition to
Completeuntil the device is actually ready to answer normal application commands. Serial transport re-enumeration succeeds well before the application firmware is up; without this wait, downstream flows (LAN chip-info queries, WiFi prep) hit a half-started device — exactly the pattern desktop had to work around with its own retry loop.API additions
FirmwareUpdateServiceOptionsgains 3 new properties:PostReconnectReadinessProbe—Func<IStreamingDevice, CancellationToken, Task<bool>>?. Returnstruewhen the application is responsive. Null disables the wait (legacy behavior preserved).PostReconnectReadinessTimeout(default 30s) — wall-clock budget for the wait.PostReconnectReadinessRetryDelay(default 500ms) — delay between probe attempts.When the timeout elapses without the probe returning true, the update transitions to
Failedwith aTimeoutExceptionwrapped inFirmwareUpdateException— NOT a silentCompleteon a half-ready device, which is the entire point of #145.Test plan
Completeuntil ready (assertCompletestate + probe call count = 3)FirmwareUpdateExceptionwithFailedState=JumpingToAppand the readiness keyword in the inner messageCloses #145