Skip to content

refactor(web): extract assistant lifecycle state machine into a non-React service#32654

Merged
vex-assistant-bot[bot] merged 7 commits into
mainfrom
ashlee/lum-2062-lifecycle-service
May 30, 2026
Merged

refactor(web): extract assistant lifecycle state machine into a non-React service#32654
vex-assistant-bot[bot] merged 7 commits into
mainfrom
ashlee/lum-2062-lifecycle-service

Conversation

@ashleeradka
Copy link
Copy Markdown
Contributor

Closes LUM-2062. Follow-up to #32641.

Why

#32641 published assistant lifecycle through Zustand and killed the outlet-context plumbing, but left the state machine itself React-shaped. The hook held all the domain state in useRefs and useState, defined the actions inside useCallback closures over React-bound primitives (useQueryClient, useMutation), and registered them into the store via a passive useEffect. That shape produced every Codex P1 we caught during #32641 cleanup:

  • The action-registry pattern (actions can't be defined inline in create() because they close over React state, so they have to be registered via useEffect after mount).
  • The NOOP_CHECK / NOOP_VOID defaults that cover the brief registration race.
  • The cold-mount capture bug — render-time .getState() captures the no-op because the parent's register-actions effect hasn't committed yet.
  • The selector-flip bug — .use.X() re-renders on registration, which cleans up and re-runs any effect listing the action as a dep, which can re-execute side effects (duplicate hatch POST in the cloud path).

The PR description on #32641 explicitly called out the action-registry pattern as the trade we made and pointed at LUM-2062 as the architectural finish line. This is that.

What changed

The state machine moves out of React entirely.

Concern Before (#32641 final) After (this PR)
State machine useAssistantLifecycle (676 lines): useState for assistantState, 6 useRefs for counters/generation/in-flight flags, action useCallbacks, register-actions useEffect, watchdog useEffect lifecycle-service.ts (620 lines): plain TypeScript class, module-level singleton, instantiated at import
Hook 676 lines 99 linesuseQueryClient, useAssistantQuery, two useEffects that push inputs and poll results into the service
Store 110 lines, holds state + 3 action fields + NOOP defaults + registerImperativeActions action 39 lines, holds assistantState only
Consumer call shape useAssistantLifecycleStore.getState().checkAssistant() (sometimes wrapped in useCallback to defeat identity flips) lifecycleService.checkAssistant() — service methods are stable from instant zero

The state machine semantics don't change. Same phases, same retry budgets, same recovery paths, same gateway-auth / local-mode branches, same post-hatch cache seed, same onboarding-redirect coordination. The extraction is purely structural.

Benefits

  • Direct unit tests for the state machine. lifecycle-service.test.ts exercises the server-state projection (active and self-hosted result handling), the auto-hatch cascade (the nonprod / retired / vanilla-managed branches), the logout reset, and the gateway-auth short-circuit — no renderHook, no React tree. Recovery paths and retry budgets are reachable from tests for the first time.
  • Cold-mount race goes away. Service methods are stable from import, so consumers can call them inline without selector subscriptions or wrapping. The duplicate-hatch bug Codex caught in refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641 cleanup is structurally impossible.
  • No more .use.X() vs .getState() decision for lifecycle actions. Either pattern would have worked equally well before — neither does anything special now, because the action isn't on the store. Just import the service and call methods.
  • Single state system. Counters, generation, in-flight guards no longer live in useRef parallel to the store. Service fields are the only place state lives; the store mirrors assistantState for React reactivity.
  • The hook is small enough to read in one sitting. 99 lines including the docstring and the inputs interface.

Why safe

  • Behavioral parity. Every transition in the service runs in the same conditions as the previous hook. Every setSelfHostedConnection(...), every setQueryData cache seed, every invalidateQueries backoff fires under the same predicate.
  • Integration tests pass unmodified. queries.test.ts, onboarding-lifecycle-sync.test.tsx, use-event-bus-init.test.tsx — the existing behavioral coverage of the hook. The only test changes are mock targets (@/assistant/lifecycle-store@/assistant/lifecycle-service in two test files; both tests still assert the same behavior).
  • tsc --noEmit, lint, audit:cross-domain, test:ci all clean. 181/181 test files pass (added one new unit-test file).
  • No new cross-domain imports. Service lives in src/assistant/ next to the existing files; consumers are unchanged in scope.

References

What I considered and did NOT do, and why

Did not switch useAssistantLifecycleStore to a zustand/vanilla store or useSyncExternalStore. Either would work — the service would own the vanilla store as its private state container, and useAssistantLifecycleStore would be a thin React wrapper around it. Kept the create<...> + createSelectors shape because (a) it's consistent with every other store in the codebase, (b) useAssistantLifecycleStore.setState({...}) from the service is the same operation the previous code did via the store's own action, just inline, and (c) the migration was big enough already.

Did not own the polling timer inside the service. TanStack Query's refetchInterval (in useAssistantQuery) still drives the 3-second poll cadence; the service just receives results via the hook's useEffect. Doing in-service polling would lose TanStack Query's dedup with other /assistant/ subscribers (sidebar header, identity panel) and the integration with the setQueryData post-hatch seed. Out of scope for this ticket.

Did not move the logout-clear into a synchronous call at the logout site (the vex-bot non-blocking observation on #32641). Same code as before — service resets when respondToInputs() sees !isLoggedIn. Moving it earlier in the call chain (e.g. into the auth-store logout action calling lifecycleService.respondToInputs() directly) is a one-line follow-up if you ever see flake; not in scope here.

Did not inject the API functions (getAssistant, hatchAssistant, retireAssistantById) through setInputs. The service imports them at module level. Tests mock via mock.module(...) — same pattern as everything else in the codebase. Constructor injection would have been the "more textbook" non-React shape but adds plumbing without solving a real problem for one consumer.

Did not extract a LifecycleStateMachine interface separate from the service. Could have separated "pure transitions" (a state-table-like function) from "side effects" (the cache seeding, the self-hosted connection priming). Considered briefly; the recovery and post-hatch paths are so interleaved with side effects that separating them just spreads the logic across two files. Held off.

Did not delete useAssistantLifecycleStore in favor of subscribing directly to a service emitter. The store is still the React tree's reactive view; the service writes to it. Two-layer separation (service for logic, store for reactivity) is the same trade useEventBusStore makes for the event bus. Consistent across the codebase.


Root cause analysis

How did the code get into this state?

#32641 extracted lifecycle into Zustand stores, which fixed the outlet-context plumbing problem. But the hook kept owning the state machine through React primitives (useRef, useCallback, useEffect) because that was the smallest diff that achieved the outlet-context fix. The trade was explicit in the #32641 PR description, with this ticket as the planned follow-up.

The deeper cause is structural: the lifecycle state machine is domain logic with no React-specific dependencies. React was the wrong shape for it from the start, but it was also the lowest-friction shape when the hook was small. As the machine grew (retry budgets, recovery paths, generation counter, watchdog), the React-bound shape became increasingly costly.

What mistakes or decisions led to it?

  1. Built the state machine inside the hook to begin with. The first version was useState + useEffect — small, simple, idiomatic React. Each new responsibility (retry budgets, recovery, watchdog) was added inside the hook because the hook already existed. No one stopped to ask "should this live outside React?"
  2. Stopped at the Zustand-store refactor in refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641. That was a real architectural improvement, but it left the action-registry pattern in place. The PR called out the trade, but a "stop at the trade" PR will accumulate workarounds (NOOP defaults, selector-vs-getState debates, cold-mount bugs) if not finished.

Were there warning signs we missed?

  • The lifecycle hook needed 6 useRefs just to hold private state. Refs are a code smell for state that's outgrowing the hook shape.
  • hatchMutateRef, retireMutateRef — capturing useMutation's .mutateAsync in refs because the mutation object itself was unstable. That's React fighting React.
  • The isRetiredRef / isNonProductionRef / onRedirectRef / resolveOnboardingRedirectRef pattern — capturing every external value in a ref to keep useCallback identities stable. Each ref was a workaround for the fact that the actions wanted to be stable but the hook couldn't make them so.
  • Three Codex P1s in refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641 cleanup, all variants of "action identity does something unexpected at the React/non-React boundary." The frequency was the signal.

What can we do to prevent this pattern from recurring?

  1. Ask "does this need React?" before building state machines in hooks. If the answer is "the state machine doesn't, but reactivity does," that's the service-with-store-output shape this PR demonstrates. Build that from the start rather than landing in it.
  2. Treat useRef-as-private-state count as a smell. A handful is normal. Six refs is a hook trying to be a class.
  3. Don't stop at "trade" PRs unless the next step is filed. refactor(web): publish assistant lifecycle through Zustand, drop outlet-context plumbing #32641 did this right (LUM-2062 was filed before merge). The systemic issue is when "trade" PRs get filed without the finish line tracked — the workaround becomes permanent.

AGENTS.md changes

No new rule. The outlet-context rule from #32641 stays in place. The lifecycle-store docstring nuance about "selector subscription vs .getState() for the registered actions" is deleted, not preserved — the trap no longer exists, so leaving the warning would mislead.

The pattern this PR demonstrates (hook-as-React-wiring + module-level service-as-state-machine + Zustand-as-reactive-output) is worth a brief mention in STATE_MANAGEMENT.md if a second example appears. For now the in-file docstrings (lifecycle-service.ts header) carry the explanation.


https://claude.ai/code/session_01NjQuUAiXsS4DVEmKPsb1uE


Generated by Claude Code

Closes LUM-2062 — the architectural finish line that #32641 set up.

The lifecycle state machine (hatch + initializing-recovery retry
budgets, generation counter for stale-response handling, 5-minute
stuck-init watchdog, gateway-auth short-circuit, post-hatch cache
seed, onboarding-redirect coordination) now lives in
`apps/web/src/assistant/lifecycle-service.ts` as a plain TypeScript
class — module-level singleton, instantiated at import,
React-independent.

What the extraction unlocks:

- **No action-registry pattern.** Actions are methods on the
  service, stable from instant zero. No `registerImperativeActions`
  effect, no `NOOP_CHECK` / `NOOP_VOID` defaults, no "use selectors
  not getState" docstring nuance, no "use getState not selectors"
  reversal of same. Consumers import `lifecycleService` and call
  methods directly.
- **No private refs alongside store state.** All state (counters,
  generation, hatching flag, in-flight ids) lives as instance
  fields on the service. Single state system, no React/store
  duality.
- **The watchdog moves out of React.** `transition()` arms / clears
  the 5-minute stuck-init timer based on the next state's kind —
  no more `initializingCycle` re-arming via useEffect.
- **Direct unit tests.** `lifecycle-service.test.ts` exercises the
  server-state projection, the auto-hatch cascade with the
  nonprod / retired branches, the logout reset, and the
  gateway-auth short-circuit — all without `renderHook`, all
  asserting on `lifecycleService` + the two Zustand stores
  directly.

The hook (`use-lifecycle.ts`) shrinks from 676 lines to 99 — just
the React-bound wiring: `useQueryClient`, `useAssistantQuery`,
two `useEffect`s that push inputs and poll results into the
service.

The store (`lifecycle-store.ts`) shrinks from 110 lines to 39 —
just the `assistantState` field. No actions, no NOOP defaults.

Consumers (`chat-page`, `hatching-screen`, `pre-chat-flow`,
`use-event-bus-init`) switch from
`useAssistantLifecycleStore.getState().checkAssistant()` to
`lifecycleService.checkAssistant()`. The chat-page wrappers that
existed only to defeat the action-registry identity flip
(`useCallback(() => store.getState().X(), [])`) get to stay tidy
because the service methods are stable from import.

Tests: 181/181 pass, including 7 new direct unit tests for the
service. `tsc --noEmit`, `lint`, and `audit:cross-domain` all
clean. No new cross-domain imports.

https://claude.ai/code/session_01NjQuUAiXsS4DVEmKPsb1uE
@linear
Copy link
Copy Markdown

linear Bot commented May 30, 2026

LUM-2062

@ashleeradka ashleeradka marked this pull request as ready for review May 30, 2026 01:43
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 527f268a45

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +367 to +369
if (generation !== this.generation) return;
if (nextState.kind !== "active") {
this.transition(nextState);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the stuck-initializing watchdog across polls

When the assistant remains initializing, every 3-second poll reaches this path and calls transition({ kind: "initializing" }) again. transition() now clears and re-arms the initializing watchdog on every call, so continuous initializing poll results keep resetting the 5-minute timer and recoverStuckInitializingAssistant() never runs for a genuinely stuck hatch.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

claude added 2 commits May 30, 2026 01:50
Two changes from the post-open audit:

**Codex P1 — watchdog reset by every poll.** My initial
`syncInitializingWatchdog` cleared + re-armed the 5-minute timer
on every `transition()` call. While the assistant stayed
`initializing`, every 3-second poll re-entered the projection
path and called `transition({ kind: "initializing" })` again,
which rearmed the timer — so the watchdog never expired and
`recoverStuckInitializingAssistant()` would never fire for a
genuinely stuck hatch.

The original React implementation avoided this by keying the
watchdog `useEffect` on `[assistantStateKind, initializingCycle]`;
back-to-back same-kind transitions didn't re-run the effect,
keeping the timer counting down. `initializingCycle` was bumped
only inside `hatchAndCheck`, which is how a fresh hatch attempt
restarted the clock.

Rewriting `transition()` to edge-trigger (arm only on entry into
`initializing`, clear only on exit) and adding an explicit
`armInitializingWatchdog()` call inside `hatchAndCheck` to match
the cycle-bump semantics. Net effect: same timer behavior as the
pre-extraction code.

**Audit DRY** — `applyServerStateUpdate` and
`recoverStuckInitializingAssistant` duplicated the active and
self_hosted projection logic. Extracted `projectActive(result)`
and `projectSelfHosted(result)` private helpers. Preserved the
intentional pre-existing asymmetry where the recovery path
doesn't clear `hatchingVersion` on active landing (each call
site still manages its own counter / flag bookkeeping).

**New tests** for the bug Codex caught and the ticket-mandated
retry-budget exhaustion:

  - `redundant initializing→initializing transitions do not reset the 5-minute clock`
  - `3 recoverable hatch failures in the auto-hatch poll loop surface as error`

181/181 test files pass.

https://claude.ai/code/session_01NjQuUAiXsS4DVEmKPsb1uE
`applyServerResult` takes `GetAssistantResult`, whose `data`
satisfies the full `Assistant` type. My watchdog test passed an
object literal with just `{id, status}` — CI flagged the missing
13 required fields.

Switched the test to call `lifecycleService.checkAssistant()`
twice more instead of constructing the result by hand. The
fake queryClient routes back through `getAssistantMock` (which
has loosely-typed return values per its mock definition), so
the test exercises the same projection path without
constructing the full type.

Also: added a note in `beforeEach` flagging that `mockClear`
doesn't reset implementations — future mocked deps added here
must re-set their baseline in this block or tests will silently
inherit the previous test's stub.
@ashleeradka
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aeaf269525

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


const navigateToChatAfterLifecycleRefresh = useCallback(async () => {
await useAssistantLifecycleStore.getState().checkAssistant();
await lifecycleService.checkAssistant();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard lifecycle calls before root inputs are installed

When this page mounts directly with readOnboardingCompleted() already true, its child useEffect can run before RootLayout's passive effect calls lifecycleService.setInputs(). This direct singleton call then uses the service's default queryClient: null, so checkAssistant() catches a TypeError and publishes a spurious network-error lifecycle state before navigating; the previous store action had a no-op pre-registration default specifically for this window. Keep a no-op/ready guard or initialize the service before child effects can invoke it.

Useful? React with 👍 / 👎.

claude added 2 commits May 30, 2026 02:04
Codex P2: a child route's `useEffect` that calls a lifecycle
action (e.g. `lifecycleService.checkAssistant()`) on a direct
mount can fire BEFORE `RootLayout`'s passive effect has called
`setInputs()` — children's effects commit before parent's in
the same render cycle. The unguarded action tried to use the
default `null` queryClient, threw a TypeError on
`null.fetchQuery(...)`, caught it, and published a spurious
network-error state.

The previous store-based action API papered this over with
NOOP_CHECK / NOOP_VOID defaults. The service had no equivalent.

Adding a `ready` flag that flips true on the first `setInputs()`
call. Public actions (`checkAssistant`, `retryAssistant`,
`hatchVersion`, `respondToInputs`, `applyServerResult`)
early-return when `!ready`. Matches the no-op-default behavior
the old store API used to ship with, but now expressed
explicitly on the service.

New test asserts that calling actions before `setInputs` is a
no-op and leaves the lifecycle store in its initial `loading`
state.

181/181 test files pass.
The docstring said "Mirrors the no-op defaults the pre-service
store action API used to ship with" — references a shape that
no longer exists in the codebase. Rephrased to state the
invariant the guard protects (children's effects committing
before the parent's) without naming a previous implementation.
@ashleeradka
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vex-assistant-bot[bot]
vex-assistant-bot Bot previously approved these changes May 30, 2026
Copy link
Copy Markdown
Contributor

@vex-assistant-bot vex-assistant-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE — reviewed at 25a2dcbd4e

Value: Closes the architectural finish line #32641's PR description pointed at. The action-registry pattern, NOOP defaults, render-time capture race, and selector-flip race — all four Codex P1s from #32641 dissolve because the state machine no longer lives inside React's render-and-effect timing. Plus: the machine is now directly testable without renderHook.

What this does:

  • New lifecycle-service.ts (642 lines) — module-level singleton class owning state machine, watchdog, retry budgets, generation counter, gateway-auth short-circuit, cache seeding, onboarding-redirect coordination.
  • use-lifecycle.ts shrinks 676 → 99 lines — pure wiring: pull React signals + push via setInputs, hand poll results via applyServerResult. No state, no callbacks.
  • lifecycle-store.ts shrinks 110 → 39 lines — assistantState only. The store is now strictly an observability surface; only the service writes.
  • Consumer call shape simplifies: useAssistantLifecycleStore.getState().checkAssistant()lifecycleService.checkAssistant() across chat-page, hatching-screen, pre-chat-flow, use-event-bus-init.
  • New direct unit tests in lifecycle-service.test.ts exercise server-state projection (active + self-hosted), the auto-hatch cascade (nonprod / retired / vanilla-managed branches), the logout reset, and the gateway-auth short-circuit. No renderHook, no React tree.
Full analysis

Codex P1 (watchdog reset on every poll) — closed at 4a9aade767

The original transition() cleared+re-armed the watchdog on every call, so back-to-back initializing → initializing poll results reset the 5-minute clock indefinitely and recoverStuckInitializingAssistant() could never fire on a genuinely stuck hatch. Fix is edge-triggered:

if (next.kind !== "initializing") {
  if (this.initializingTimeout) clearTimeout(...);
} else if (prevKind !== "initializing") {
  this.armInitializingWatchdog();
}

A repeat initializing → initializing is a no-op kind-wise — no clear, no re-arm. hatchAndCheck() calls armInitializingWatchdog() explicitly so a fresh hatch attempt still resets the clock. The inline comment spells out the invariant — high-leverage for the next maintainer.

Codex P2 (pre-mount lifecycle call → null queryClient) — closed at 26521f2162

ready: boolean ships false, setInputs() is the only flip-to-true. Every public action (checkAssistant, respondToInputs, applyServerResult) early-returns if (!this.ready). So a child route whose useEffect commits before RootLayout's setInputs lands — the parent/child effect commit-order race — becomes a silent no-op instead of TypeError-on-null.fetchQuery followed by a spurious network-error state being published.

This is the structural equivalent of the NOOP_CHECK defaults the previous version shipped — same invariant, but the guard lives once on the class instead of being scattered across three action fields with NOOP fallbacks the store had to expose. The docstring on ready explicitly cites the race; future maintainers see why it's there.

Test architecture

Bun's mock.module at module scope + lifecycleService.__resetForTesting() in beforeEach/afterEach is the right pattern for a module-level singleton. The comment in beforeEach about why mockClear isn't enough (and why every new mock needs a fresh baseline) is doing real work — the kind of subtle leak that costs 30 minutes of debugging a flaky test.

The fact that auto-hatch cascade, retry-budget exhaustion, and the gateway-auth short-circuit are now reachable from a unit test — no renderHook, no fake timers, no time-travel reasoning about effect commit order — is the headline benefit of the extraction. The state machine being a class with __resetForTesting is the kind of testability boring frontend code rarely gets.

Observations (non-blocking)

  1. useEffect dep array in use-lifecycle.ts:75–86 lists onRedirect and resolveOnboardingRedirect. If the parent constructs either inline per render, the effect re-runs every render → setInputs re-publishes identical values → respondToInputs re-fires. Most calls are early-return no-ops inside the service, so it's not a correctness bug — but worth confirming RootLayout wraps both in useCallback. The kind of thing the next render-perf pass will catch if it's drifted.

  2. getLocalGatewayUrl mock at lifecycle-service.test.ts:41 is set up but I don't see it asserted in any test. Either dead mock or pending coverage for the local-gateway projection path. Drop it or add the assertion.

  3. useCallback wrappers in chat-page.tsx:174–187 — still necessary because lifecycleService.checkAssistant is a normal class method (passing the bare reference loses this). Two cleaner shapes if you ever want to revisit: (a) define the actions as arrow-class-fields so this binds at construction; (b) ship pre-bound exports next to the singleton. Current form is fine — the comment above each wrapper could shrink now that the identity-flip rationale is gone, but not worth a follow-up commit.

Merge gate

  • CI 7/7 green at HEAD
  • Codex review at HEAD: "Didn't find any major issues. Hooray!"
  • Devin review: 0 potential issues
  • Both Codex inline findings (P1 watchdog, P2 input guard) verified closed in code

Mergeable.

Vellum Constitution — Trust-seeking: by lifting the state machine out of React's commit-order choreography, the four classes of failure modes Codex caught in the predecessor PR become structurally impossible to express, and the machine is now reachable from unit tests with no time-travel reasoning about render order.

The comment referenced "actions registered in a passive effect"
and the render-time-capture-vs-selector-subscription debate —
all of which became moot when the state machine moved out of
React. Updated to state the current invariants: stable identity
across renders for prop boundaries, and `this` binding preserved
on call (the underlying reason class methods need a wrapper).
Closes LUM-2067 — the test-coverage gap I'd been planning as a
follow-up. Two new direct unit tests exercise the recovery code
that was the last meaningful uncovered area in the service:

1. **Watchdog firing on a still-initializing assistant retires
   and re-hatches.** Verifies the timer fires, recovery captures
   the stuck assistant id, calls `retireAssistantById`, and
   follows up with `hatchAssistant`. The fix from `4a9aade767`
   that edge-triggers the watchdog is what makes this reachable —
   without it, redundant `initializing→initializing` poll
   transitions would reset the timer indefinitely.

2. **MAX_INITIALIZING_RECOVERIES failed recoveries surface as a
   terminal timeout error.** Drives four watchdog cycles where
   each recovery's rehatch returns initializing (the assistant
   stays stuck). After the budget exhausts, the next watchdog
   firing transitions to the `buildInitializingTimeoutError`
   shape. Confirms `MAX_INITIALIZING_RECOVERIES` actually bounds
   the loop (the kind of off-by-one that would manifest as users
   permanently stuck on the initializing screen).

Pattern: `mock.module("@/assistant/lifecycle", ...)` overrides
`INITIALIZING_TIMEOUT_MS` to 30ms while preserving the real
helpers (`resolveAssistantLifecycleState`,
`buildInitializingTimeoutError`, etc.). The watchdog runs in
real time against the shrunk timeout, no fake-timer machinery,
no time-travel reasoning.

Left for the integration tests: the "force-refresh found active"
and post-hatch active-landing branches both depend on the
React-tree polling loop firing `applyServerResult` from the
next `useAssistantQuery` data tick. The state machine alone
can't reach those outcomes without a poll driver, so
`onboarding-lifecycle-sync.test.tsx` carries that coverage.

181/181 tests pass.
Copy link
Copy Markdown
Contributor

@vex-assistant-bot vex-assistant-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE — reviewed at c9a8b15da0

Two commits since the prior APPROVE at 25a2dcbd4e, both targeted at observations from that review.

Test coverage extension (c9a8b15da0, closing LUM-2067):

12 direct unit tests now, +2 from the prior pass — both exercising the recovery code that was the last meaningful uncovered area in the service:

  1. Watchdog → retire → re-hatch end-to-end. The watchdog is reachable from a test only because transition()'s edge-trigger fix (4a9aade767) stopped initializing → initializing poll repeats from indefinitely resetting the clock. The test captures that explicitly: a getAssistant → initializing projection arms the watchdog, the watchdog fires after TEST_INITIALIZING_TIMEOUT_MS (30ms), retireAssistantById is called with the captured initializingAssistantId, then hatchAssistant follows up. The structural fix made the test possible — which is exactly the right shape.

  2. MAX_INITIALIZING_RECOVERIES budget. Drives four watchdog cycles where each recovery's re-hatch returns initializing (the assistant stays stuck). After the budget exhausts, the next firing transitions to buildInitializingTimeoutError. The kind of off-by-one that would surface as users permanently stuck on the initializing screen — now a regression test catches it.

The mocking pattern is the right call. mock.module("@/assistant/lifecycle", ...) overrides only INITIALIZING_TIMEOUT_MS, preserving the real resolveAssistantLifecycleState / buildInitializingTimeoutError / shouldRecoverFromHatchFailure helpers. The watchdog runs in real time against the shrunk timeout — no fake-timer machinery, no time-travel reasoning about effect commit order. Same testability headline as the original extraction, extended.

Integration boundary articulated. The commit message names the two branches deliberately left to integration coverage — "force-refresh found active" (recovery branch when initializingAssistantId was null at watchdog time) and post-hatch active landing — because both depend on the React-tree polling loop firing applyServerResult from the next useAssistantQuery data tick. The state machine alone can't reach those outcomes without a poll driver, so onboarding-lifecycle-sync.test.tsx carries them. That's a principled unit-vs-integration boundary, not a coverage gap to apologize for.

Comment cleanup (ad191fd7b9):

Picks up Observation #3 from the prior review. The old comment carried forward the action-registry / NOOP-defaults / selector-flip rationale that became moot when the state machine moved out of React. New comment states the current invariants: stable identity for prop boundaries, this binding preserved on call. Crisp.

Observation #2 (unused getLocalGatewayUrl mock at lifecycle-service.test.ts:63) — still present, still unused. Lowest-stakes of the three; non-blocking. Drop on the next pass or annotate as a future-coverage placeholder.

Observation #1 (wiring-effect dep stability) — already verified moot in the prior review: RootLayout passes useNavigate (stable hook output) and a module-imported resolveOnboardingRedirect (program-lifetime constant). The wiring effect doesn't re-fire on renders.

Merge gate

  • CI 7/7 green at HEAD
  • Codex review at HEAD: "Didn't find any major issues" (the lingering P1/P2 inline anchors are auto-migrated from older commits — both verified structurally closed in the prior review)
  • Devin review: 0 potential issues
  • 181/181 tests pass per commit message
  • All three prior-review observations either addressed (#3) or verified non-issues (#1, #2)

Mergeable. Test additions tilt this from "ship and watch" to "ship with confidence" — the recovery code that runs on a genuinely-stuck cold mount now has end-to-end regression coverage.

@vex-assistant-bot vex-assistant-bot Bot merged commit 527ca02 into main May 30, 2026
7 checks passed
@vex-assistant-bot vex-assistant-bot Bot deleted the ashlee/lum-2062-lifecycle-service branch May 30, 2026 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants